Closed vitobotta closed 4 years ago
I see the same (but chart v0.9.3) - it will just freeze and system load will increase and increase.
doing iostat -x 1
you will see 100% util on the RDB devices, but not actually any I/O
I have tried with a new single-node cluster from scratch this time with 4 cores from UpCloud, which also happens to have the fastest disks (by far) I've seen with the cloud providers I have tried, so it's unlikely a problem with disks. Well, exactly the same problem. :( After a little while downloading many largish files like videos, the server became totally unresponsive I couldn't even SSH into it again. Like I said earlier, with the previous version of Rook I could do exactly the same operation (basically I am testing the migration of around 25GB of Nextcloud data from an old pre-Kubernetes server) even with servers with just 2 cores using Rook v0.9.3. I am going to try again with this version...
Also, since I couldn't SSH into the servers I checked the web console from UpCloud and saw this:
Not sure if it's helpful.... I was also wondering whether there are issues using Rook v1.0 with K3S, since I've used K3S with these clusters (but also with v0.9.3 which was OK). Perhaps I should also try with standard Kubernetes just to see if there's a problem there. I'll do this now..
@vitobotta , I've seen this hung-task message when something like RBD or Cephfs is unresponsive and a VM thinks that the I/O subsystem is hung. So the question then becomes why is Ceph unresponsive? Is the Ceph cluster healthy when this happens? ceph health detail. Can you get a dump of your ceph parameters using the admin socket, something like "ceph daemon osd.5 config show". Does K8S show any Ceph pods in bad state?
You may want to pay attention to memory utilization by OSDs. What is the CGroup memory limit for rook.io OSD pods and what is the ceph.conf-defined osd_memory_target set to? Default for osd_memory_target is 4 GiB, much higher than default for OSD pod "resources": "limits". This can cause OSDs to exceed the CGroup limit. Can you do a "kubectl describe nodes" and look at what the memory limits for the different Ceph pods actually are? You may want to raise limits in cluster.yaml and/or lower osd_memory_target. Let me know if this helps. See this article on osd_memory_target
Hi @bengland2, yes the clusters (I have tried with several) were healthy etc when I was doing these tests.
In the meantime I have recreated the whole thing again but this time with OpenEBS instead of Rook just to test, and while OpenEBS was slower I didn't have any issues at all, with load never above 4.
With Rook, same test on same specs it reached 40 or even more until I had to forcefully reboot, and this happened more than once. I am going to try once again with OpenEBS to see if I was just lucky...
@vitobotta Sounds like you are copying files to an RBD volume. Try lowering your kernel dirty pages way down (e.g. sysctl -w vm.dirty_ratio=3 vm.dirty_background_ratio=1) on your RBD client and see if that makes write response times more reasonable. Also, maybe you need to give your OSDs more RAM, in rook this is done with resources: parameter. A Bluestore OSD expects to have > 4 GiB of RAM by default. Older rook.io may not be doing this by default. Ask me if you need more details.
The weird thing is that I didn't seem to have these issues with the previous version, using the same specs and config. Not sure of what rbd client you mean, I just mounted the /dev/rbX device into a directory :)
@vitobotta by "RBD client" I meant the host where you mounted /dev/rbdX. Also I expect you are using Bluestore not Filestore OSDs.
I think Filestore since I was using a directory on the main disk rather than additional disks.
Filestore is basically in maintenance mode at this point, you should be using Bluestore, which has much more predictable write latency. Let us know if Bluestore is giving you trouble.
Hi @bengland2, I didn't read anywhere that Filestore (thus directories support?) is in not in active development, I must have missed it... I will try with additional disks instead of directories so I can test with Bluestore when I have time.
Today I had a chance to repeat the test with that 25GB of mixed data with a new 3-node cluster with Rook 1.0 installed. The test started well until it was extracting/copying videos, at which point once again the load average climbed quite quickly to over 70 on a node and 40 on another, so I had to forcefully reboot the two nodes. I uninstalled/cleaned up Rook completely, and repeated the test with OpenEBS first, and Longhorn after that. OpenEBS was again very very slow but worked, while Longhorn reached a load of max 12 when processing videos but then it completed the task and I was able to move on.
Also this time I am running standard Kubernetes 1.13.5, not K3S, so I have excluded both that it could be a problem with K3S, and that it could be a problem with the provider I was using before (Hetzner Cloud).
I don't know what to say... I hoped I could use Rook because it's faster and I have heard good things, but for me from these tests it looks almost unusable when dealing with large files. At least that's the impression I have unfortunately :(
I will try with disks instead of directories when I have a chance. Thanks
I can't believe it! :D
I decided to try Bluestore now because I want to understand what's going on, so I set up a new cluster this time with DigitalOcean (3x 4 cores, 8GB ram) and added volumes to the droplets, so to use these disks with Ceph instead of a directory on the main disk. I was able to complete the usual test and the load never went above 5 when extracting videos!
I don't think it's because of DigitalOcean vs Hetzner Cloud/UpCloud, I guess the problem was as you suggested Filestore with directories. But out of curiosity why is there such a big difference in performance and CPU usage between Filestore and Bluestore? Thanks! I'm gonna try the experiment once again just in case, and if it works I will be very happy! :)
Tried again and had the same problem. :(
I believe this may be an issue with Ceph itself. It's my understanding that the Ceph OSDs with Bluestore can use a lot of CPU in some circumstances. I think this is especially true for clusters with many OSDs and clusters with very fast OSDs.
Bluestore will generally result in better performance compared to Filestore, but the performance also comes with more CPU overhead. It's also my understanding that in today's hardware landscape, Ceph performance is often bottlenecked by CPU.
Update: I created a Ceph issue here https://tracker.ceph.com/issues/40068
@BlaineEXE To see what they are doing about it, see project crimson. Ceph was designed in a world of HDDs, with 3 orders of magnitude less random IOPS per device. So yes it needs an overhaul, and they are doing that. Ceph is not the only application that is dealing with this.
An update... as suggested by @BlaineEXE I did my usual test but using the latest Mimic image instead of Nautilus. It worked just fine with two clusters and managed to finish copying the test data with a very low CPU usage. I repeated this twice with two different clusters, successfully both times. For the third test, I just updated Ceph to Nautilus on the second cluster, and surprisingly the test finished ok again. But then I created a new cluster with Nautilus from the start and boom, usual problem. Started OK until I had to forcefully reboot the server. This is a single node cluster (4 cores, 16 GB of ram) with Rook 1.0.1 on Kubernetes 1.13.5 deployed with Rancher. There's a problem somewhere, I just wish I knew where.
Is @sp98 still working on this issue? It would be great to see if there are any noticeable differences between how Rook starts up a Mimic cluster compared to how it starts up a Nautilus cluster to determine if Rook is the cause. We should also pay close attention to the behavior of ceph-volume, as the initial OSD prep using Mimic's c-v could be different than the prep using Nautilus' c-v.
@BlaineEXE Yes. But had to move to 2696. Will jump back up to this one in few days time. Thanks for those updates above. I'll try that and update my findings here.
Just tried once again with a new cluster, again with the latest version from the start, same problem. As of now I am still unable to actually use Rook/Ceph :( It's not like there are things that I could do wrong because it's so easy to install etc... so I don't know where to look. This time the problem occurred very quickly after I started copying data into a volume.
I was wondering, could it be something related to using a volume directly bypassing kubernetes?
Not sure if it's helpful, but what I am trying to do is download some data from an existing server into a volume so that I can use that data with Nextcloud. In order to do this, because there are timeouts etc if I try to do it from inside a pod, this is what I do to use the volume directly:
rbd map <pvc> -p replicapool
which gives me the device name, e.g. /dev/rbd3
mkdir nextcloud-data
mount /dev/rbd3 nextcloud-data
mkdir -p nextcloud-data/_migration
cd nextcloud-data/_migration/
ssh old-server "cat /data/containers/nextcloud.tar" | tar -xvv
It starts downloading the data and then at some random point, sooner or later load will climb very very quickly up to 70-80 or more until I have to forcefully reboot the server. Since as said I don't know where to look, I am really confused by this problem and I even thought it may have something to do with the fact that I am extracting the archive while downloading it (I know, it doesn't make sense), but the problem occurs also if I just download the archive somewhere first and then extract it into the volume.
I am new to all of this so I wish I had more knowledge on how to further investigate :( I don't have any of these issues when using Longhorn or OpenEBS but I would prefer using Rook at this point for the performance and because Ceph is an established solution, while the others are very new and have their own issues.
@vitobotta Can you confirm if you have had this issue when deploying mimic (v13) or only with nautilus (v14)? If you haven't tried mimic with rook 1.0, could you try that combination? It would be helpful to confirm if this is a nautilus issue, or if it's rook 1.0 that causes the issue and it happens on both mimic and nautilus.
@markhpc @bengland2 Any other secrets up your sleeve for tracking down perf issues besides what has been mentioned? Thanks!
Hi @travisn , I did a couple of tests with Mimic the other day and didn't have any problems with it. I just tried again with Mimic (v13.2.5-20190410) right now and all was good. Since I was always using Rook 1.0.1, it seems like it may be an issue with Nautilus? I am using Ubuntu 18.04 with 4.15.0-50-generic, if that helps somehow. Once I did a test with Fedora 29 (I think?) as suggested by @galexrt and it worked fine, I don't know if I was just lucky.... perhaps I can try again... to see if it happens only with Ubuntu.
Hi all, I have done some more tests with interesting results. By "tests" I don't mean anything scientific since I lack deeper understanding of how this stuff works. I mean the usual download of data into a volume as described earlier.
I have repeated the same test with multiple operating systems and these are the results:
Ubuntu 18.04: download/copy fails each time. Starts OK but then at some point (sometimes right away, sometimes even towards the end) it stops and I have to forcefully reboot the server because it becomes unresponsibe;
Fedora 29: I have tried 3 times, no problems;
CentOS 7: I have tried just once and had no problems;
RancherOS 1.5.2: I tried twice, one using the Ubuntu console and the other one using the Fedora console. The test failed both times but my understanding is that RancherOS uses Ubuntu kernel, although I am not 100% sure.
Finally, I tried three times with Ubuntu but upgrading the kernel to 5.0.0.15 before setting things up/doing the test. Each time the test worked fine without any problems.
I don't know enough about this stuff to jump to conclusions but is it possible that there is a problem with Nautilus and the default Ubuntu 18.04 kernel? To exclude the possibility that it might be a problem with the customised kernel used by the provider, I have tried on Hetzner Cloud, UpCloud and DigitalOcean with the same result: the problem occurs with the default kernel but not with 5.0.0.15.
Is there anyone so kind as to try and reproduce this? Please note that I as far as I remember I haven't seen the problem copying little amounts of data. It always happens when I copy that 24-5 GB of data that I am trying to migrate or also sometimes when I run a benchmark on a volume with fio. Thanks a lot in advance if someone can reproduce this / look into it. :)
Guys... I tried again with the 5.0.0.15 kernel and it happened again :( The first test copying the data into the volume was fine, but then I did a backup with Velero followed by a restore and the system became unresponsive during the restore, as usual...
I'm not a Ceph expert but I do deal with lots of storage systems, architectures, and technologies. My first recommendation would be to consult with optimal hardware requirements to ensure that software operates within boundaries of tested matrix:
http://docs.ceph.com/docs/jewel/start/hardware-recommendations/
I would not recommend to just go with minimal requirements. Double the numbers. Verify that your test environment conforms with that.
On Sat, Jun 1, 2019 at 6:06 AM Vito Botta notifications@github.com wrote:
Guys... I tried again with the 5.0.0.15 kernel and it happened again :( The first test copying the data into the volume was fine, but then I did a backup with Velero followed by a restore and the system became unresponsive during the restore, as usual...
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rook/rook/issues/3132?email_source=notifications&email_token=AAJVGV4AMJMJOTOXKXIGZW3PYJX4XA5CNFSM4HLNELB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWXAI5Y#issuecomment-497943671, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJVGVZ3K6RJOCBGBAQFZ2TPYJX4XANCNFSM4HLNELBQ .
I'm running into this problem as well. It's causing my Kubernetes node to flap back and forth between NotReady and Ready; containers fail to start up, even a web browser or the system monitor lock up. The system ends up with over 1,000 processes eventually, and I think it's also causing my VirtualBox to not be able to start
Currently on a bare metal single node master--k8s 1.14.1, rook deployed from release-1.0 branch with storageclass-test.yml and cluster-test.yml (except that databaseSizeMB, journalSizeMB, and osdsPerDevice was commented out).
Host is running Ubuntu 18.04.2 (currently 4.18.0-20-generic kernel) and has 2x 10-core Xeon (20 cores, 40 threads total) with 96 GB of registered DDR4 running at 2133. 1 TB 970 EVO Plus NVMe drive. Suffice it to say, it should have plenty of CPU, RAM, and I/O speed...
edit: iostat -x 1
shows utilization going very high on the NVME device most of the time - but almost no utilization (0-1%) on the rbd devices
Has anyone had a chance to try and reproduce this issue?
I would not recommend to just go with minimal requirements. Double the numbers. Verify that your test environment conforms with that.
I doubt it is a problem with the specs. First, I have tried with servers with 8 dedicated cores and 32 GB of ram, which I think is more than sufficient for my test workloads; second, the problem appears to occur only with Ubuntu so far. I have just tested twice again with CentOS (same config and same tests as usual, apart from the OS) and everything has worked fine each time. I am going to try once more.
@vitobotta I'll trying it out soon (was stuck with another ticket which is complete now). Just getting the environment ready for this ( as my local dev environment won't be sufficient to test this out).
@vitobotta I'll trying it out soon (was stuck with another ticket which is complete now). Just getting the environment ready for this ( as my local dev environment won't be sufficient to test this out).
Thanks... I think I spoke too soon :( I was doing a restore with Velero/Restic on the CentOS server, and it got stuck at 15 of 24 GB with IO wait at 40+ constantly. I rebooted the server, started the restore again and it's proceeding slowly although the IO wait is almost 0. Load is ~12 now.
It completed the second time. I don't know what happened earlier with the iowait, perhaps it was the disk? At least I didn't have the problem as with Ubuntu - i.e. the system didn't become unresponsive. I am going to try with another CentOS server/cluster.
Had again the same problem with high iowait with CentOS. The system doesn't become unresponsive like with Ubuntu, so I don't have to reboot it etc, but backups or restores etc get basically stalled because of the iowait.
Is there anyone here using Rook/Ceph with clusters deployed with Rancher? Can it be something related to how Rancher deploys Kuberrnetes? What do you use?
@vitobotta this sounds like a situation where the client workloads and rbd (OSDs) are co-located in the same kernel space?
There is apparently a 'commonly known' issue with ceph when you run clients doing heavy IO in the same kernel space as ceph (OSD) itself. Here is some additional information:
From the ceph documentation:
DO NOT mount kernel clients directly on the same node as your Ceph Storage Cluster, because kernel conflicts can arise.
Also see this post from the ceph mailing list:
The classic case is when under some memory pressure, the kernel tries to free memory by flushing the client's page cache, but doing the flush means allocating more memory on the server, making the memory pressure worse, until the whole thing just seizes up.
If you want to do a lot of reading, you can see my frustrating progression in figuring out this issue in my local rook/ceph setup.
Assuming this is the same type of situation, I suggest separating the ceph OSD runtimes from the client (applications) leveraging the storage backed by ceph to see if this helps solve your problem.
@vitobotta It is not clear to me why there would be such conflicts with Bluestore, since Bluestore is not competing for memory with kernel clients (except in TCP stack perhaps). Perhaps the documentation referenced above is now obsolete for bluestore, will take a look. But look out for CGroup memory and CPU limits - you can control these for Ceph daemons in your cluster.yaml, kubectl describe will tell you what they are.
Related to #2074 ?
Hi @billimek, I am not sure of what you mean, but if I understood rightly yes, both storage with Rook and workloads were running on the same servers. How would I go about separating the OSD runtimes from the applications? Not sure of what you mean. My goal is to use small clusters were all nodes also run the storage.
In the meantime since nobody reproduced my issue I had basically given up and have been using OpenEBS for a few weeks without any problems at all, doing exactly the same things with the same data and setup as I did with Rook. It remains a mystery for me because when testing with my data I had either the system become unresponsive with load going crazy, or ridiculously high IO wait depending on the combination of operating system and kernel. In either case I was always forced to reboot. I don't know if there's something "particular" about my data, it's just mixed stuff - documents, photos, videos, code. And like I said no problems whatsoever with OpenEBS so yeah, a mystery for me... I was looking to use Rook because I could use block storage/fs/object storage with one solution, but it looks like it doesn't like me :)
I don't understand why it's this bad. I was using an older version of rook (I believe 0.7) with the applications and OSDs happily coexisting on the same boxes.
I have had this issue multiple times, rebuilt my cluster with v13 and v14 bluestore because of it multiple times, it's actually beyond tolerable. I just get PVCs working nicely, everything running, get to working and then any large data transfer kills the workers.
Latest test : v13 OSDs on two worker nodes but not on the master. create pod that mounts the cephfs on the master. rsync using the host machine to the mountpoint created by the pod, with low bw cap and still boom whilst rsyncing large file. Also confirmed io overload with just using webdav to sync my photos from my phone, so it doesn't appear to be size of file related (avg photo on my phone 2.5Mb).
It's getting tiresome now because every time the worker IO gets that high, only a hard restart resolves it, and that upsets the operator, and I get closer to a broken cluster every time.
@iMartyn - are you using host network with Ceph?
Hi Folks,
The way I would approach this now is to try and get some more information about the root cause for the slowdown on an OSD showing high CPU utilization. It might be a bit harder to do this on an OSD inside a container, but here are some ideas:
1) To start out with just look at all of your osd and system logs to see if there's anything obvious like timeouts, tainted kernel, driver issues, etc.
2) Given that the problem may only manifest on certain linux distributions, you might want to try a tool a wrote that shows you the differences between multiple sysctl output:
https://github.com/ceph/cbt/blob/master/tools/compare_sysctl.py
The idea here is that you run something like "sudo sysctl -a > ubuntu_1804.sysctl.out" on one host and "sudo sysctl -a > centos7.sysctl.out" on another host. Then you use this tool to show the differences in the configuration settings:
~/src/cbt/tools/compare_sysctl.py centos7.sysctl.out ubuntu_1804.sysctl.out
On my Ubutnu and CentOS test boxes this showed a large number of different settings so it may be difficutl to narrow down if one setting is causing a problem, but at least it may be a place to start.
3) You might want to try running "ceph daemon osd.
4) If you have iostat or collectl, it might be worth looking at the service and wait times of the devices underneath filestore/bluestore. IE this might help pinpoint if there's something going on underneath the OSD like a slow or failing device.
5) On bare metal I often utilize my wallclock profiler (https://github.com/markhpc/gdbpmp/) or perf against the OSD to try and get an idea of what it's spending it's time doing. The idea behind the wallclock profiler is that it periodically gathers stacktraces with GDB and then formats them into a callgraph. You'll need access to gdb/python and the ceph debug binaries inside the container though which may be difficult.
@markhpc,
Do you know if it is acceptable or proper to run ceph client workloads (e.g. pods with PVCs residing in ceph/rbd storage) in the same kernel space (same k8s node) where the ceph OSDs are also running?
Is the same true if running heavy IO workloads?
@billimek From what I recall, the problem with doing that in the past was that in low memory scenarios you could end up using memory to create client requests and then not have enough memory to actually perform the OSD side writes leading to a deadlock. BlueStore may be more resilient to it vs filestore but it's possible it could still happen if you are very memory constrained.
I doubt it's causing this issues though.
@mykaul I can confirm the same happens with hostnetwork and without.
@markhpc it's kinda impossible to do any debugging in these scenarios. The load goes into the 10s and then beyond. I can't login with kb+mouse on the physical machine. I have checked the logs and the only errors are those such as are shown in @vitobotta's screenshot above. I'm running on exactly the same distro on all machines (debian 9). I was on a mixed debian 9/10 at one point and the errors were happening on the 10 worker node, but downgrading that hasn't helped. So all my devices have the same sysctl config.
I'm installing collectl now to try and gather more info.
One minor point - I note you say high CPU but that's not the reported case for any of us afaict, it's high load, low cpu.
Agreed @iMartyn, my observations were extremely high load (mostly caused my blocked IO it seemed) but relatively low CPU utilization in comparison. I actually captured screenshots of netdata and glances during these times when it was happening.
Further data point : I just restarted my dead due to load node worker
, and as it came back up, the load on the second worker machine fuji
skyrocketed. Just restarting it and I'll see what collectl shows. For info, ceph status at this point shows :
cluster:
id: 98a13418-a11d-4728-b5d3-d3a8afc876b0
health: HEALTH_WARN
1 filesystem is degraded
1 MDSs report slow metadata IOs
1 osds down
Reduced data availability: 105 pgs inactive
Degraded data redundancy: 46400/199668 objects degraded (23.239%), 93 pgs degraded, 93 pgs undersized
12 slow ops, oldest one blocked for 906 sec, osd.1 has slow ops
services:
mon: 1 daemons, quorum a
mgr: a(active)
mds: myfs-1/1/1 up {0=myfs-b=up:replay}, 1 up:standby
osd: 4 osds: 3 up, 4 in
data:
pools: 3 pools, 300 pgs
objects: 99.83 k objects, 304 GiB
usage: 551 GiB used, 12 TiB / 13 TiB avail
pgs: 35.000% pgs unknown
46400/199668 objects degraded (23.239%)
105 unknown
102 active+clean
93 active+undersized+degraded
and ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 25.46875 root default
-5 12.73438 host fuji
2 hdd 9.09569 osd.2 up 1.00000 1.00000
3 hdd 3.63869 osd.3 down 1.00000 1.00000
-3 12.73438 host worker
0 hdd 3.63869 osd.0 up 1.00000 1.00000
1 hdd 9.09569 osd.1 up 1.00000 1.00000
It seems your commands above don't work in the toolbox though :
sh-4.2# ceph daemon osd. dump_ops_in_flight
admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
Hi iMaryn,
One minor point - I note you say high CPU but that's not the reported case for any of us afaict, it's high load, low cpu.
I was just basing that off this line from the original issue:
I am not sure where the problem is but I am seeing very high CPU usage since I started using v1.0.0.
If it's high load with associated high iowait then I agree we are probably more interesting in what's happen on the disks.
sh-4.2# ceph daemon osd. dump_ops_in_flight
I think you are missing the id after "osd.". Try something like "osd.0"?
The collectl data should be useful. Is there any chance these nodes are hitting swap? (sorry if the answer is in the graphs/system data, I haven't gotten a chance to look closely yet)
Collectl raw log : https://nextcloud.martyn.berlin/s/GN7gAzK7szk59Ar - I'm not very sure how to read this, and I hope you can download it before my cluster goes down due to high load again!
Yeah I tried with numbers and still get
sh-4.2# ceph daemon osd.3 dump_ops_in_flight
admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
And no, swap is disabled on these nodes as k8s and swap are mostly incompatible (kubeadm refuses to install on swap-enabled machines)
To get disk stats:
collectl -sD -oT -p fuji-20190725-211504.raw.gz
A couple of interesting bits. sda is showing high util with lots of large read IO that tapered off toward the end of the collection period:
collectl -sD -oT -p fuji-20190725-211504.raw.gz | grep -E "Pct|Util|sda"
# <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util
21:16:10 sda 82652 2 167 495 4 11 0 5 2 31 481 1 5 4 81
21:16:20 sda 75601 4 189 400 53 8 0 4 2 45 390 11 53 4 88
21:16:30 sda 81272 18 176 461 17 7 0 4 2 27 449 3 17 4 85
21:16:40 sda 83425 2 168 496 4 8 0 5 2 17 482 1 5 4 81
21:16:50 sda 91644 4 186 494 4 7 0 4 2 24 482 1 5 4 88
21:17:00 sda 92236 4 186 495 4 7 0 4 2 29 483 1 5 4 89
21:17:10 sda 91548 2 183 499 4 8 0 5 2 28 485 1 5 4 89
21:17:20 sda 94050 2 189 498 4 8 0 5 2 24 485 1 5 4 90
21:17:30 sda 64800 2 132 492 4 11 0 5 2 22 472 1 5 4 65
21:17:40 sda 819 0 12 67 3 12 0 5 2 5 47 1 4 4 7
21:17:50 sda 1 0 0 16 16 8 0 4 2 5 2 1 6 6 2
21:18:00 sda 0 0 0 0 0 8 0 4 2 5 2 1 5 5 2
21:18:10 sda 0 0 0 0 0 8 0 4 2 5 2 1 5 5 2
I also noticed that sdc had a high wait time spike toward the beginning of the collection period and then got better over time. Still those disks are pretty busy for the last 30 seconds with the CPUs spending a lot of time waiting on IO.
collectl -sD -oT -p fuji-20190725-211504.raw.gz | grep -E "Pct|Util|sdc"
# <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util
21:16:10 sdc 2056 0 512 4 0 2305 47 38 61 59 7 4 4 1 58
21:16:20 sdc 4238 0 77 55 710 714 31 17 43 64 52 92 594 7 67
21:16:30 sdc 5182 0 212 24 185 50 8 4 13 47 24 49 182 2 64
21:16:40 sdc 21 0 1 15 1 423 37 15 28 7 26 3 7 2 3
21:16:50 sdc 0 0 0 0 0 10 0 2 5 0 4 1 0 0 0
21:17:00 sdc 0 0 0 0 0 11 0 2 5 0 5 1 0 0 0
21:17:10 sdc 114 0 29 4 0 31 4 4 9 1 4 1 0 0 2
21:17:20 sdc 936 0 234 4 0 8 0 2 4 2 4 1 0 0 13
21:17:30 sdc 438 0 108 4 0 538 16 18 29 12 7 2 2 1 14
21:17:40 sdc 17094 0 189 90 10 37 5 4 9 25 88 3 11 3 62
21:17:50 sdc 27376 0 276 99 11 19 1 3 7 56 98 3 12 3 98
21:18:00 sdc 27771 1 282 99 11 32 5 3 10 48 97 3 11 3 97
21:18:10 sdc 27392 0 276 99 11 30 3 3 10 49 98 3 12 3 97
sdb looked relatively idle comparatively. In terms of the ceph-osd processes:
collectl -sZ -oT -p fuji-20190725-211504.raw.gz | grep -E "Command|ceph-osd"
#Time PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct AccuTime RKB WKB MajF MinF Command
21:17:00 3544 root 20 3481 53 S 1G 397M 3 0.79 1.89 4 00:07.27 1894 567 0 866 ceph-osd
21:17:00 3545 root 20 3480 53 S 994M 241M 2 0.75 2.43 5 00:07.55 81K 8 0 356 ceph-osd
21:18:00 3544 root 20 3481 53 S 1G 428M 3 0.60 0.95 2 00:08.82 12K 112 0 172 ceph-osd
21:18:00 3545 root 20 3480 53 S 994M 260M 2 0.38 1.40 2 00:09.33 40K 10 0 131 ceph-osd
Nothing really dramatic there. Based on the RKB/WKB numbers I'd guess that 3544 is the osd on sdc and 3545 is the osd on sda. Curious at the low memory usage for the OSDs here though. Do you have a lower osd_memory_target than default?
nope, all default rook settings. That machine isn't very well endowed memory wise - 8Gb in total. I would expect both disks to be fairly slow (sda is a wd red 4tb and sdc is a 10tb WD mybook disk). sdb is an ssd boot drive and therefore not in the ceph cluster.
I am not sure where the problem is but I am seeing very high CPU usage since I started using v1.0.0. With three small clusters load average skyrockets to the 10s quite quickly making the nodes unusable. This happens while copying quite a bit of data to a volume mapped on the host bypassing k8s (to restore data from an existing non-k8s server). Nothing else is happening with the clusters at all. I am using low specs servers (2 cores, 8 GB of RAM) but I didn't see any of these high load issues with 0.9.3 on same-specs servers. Has something changed about Ceph or else that might explain this? I've also tried with two providers, Hetzner Cloud and UpCloud. Same issue when actually using a volume.
Is it just me or is it happening to others as well? Thanks!