rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.3k stars 2.69k forks source link

Very high CPU usage on Ceph OSDs (v1.0, v1.1) #3132

Closed vitobotta closed 4 years ago

vitobotta commented 5 years ago

I am not sure where the problem is but I am seeing very high CPU usage since I started using v1.0.0. With three small clusters load average skyrockets to the 10s quite quickly making the nodes unusable. This happens while copying quite a bit of data to a volume mapped on the host bypassing k8s (to restore data from an existing non-k8s server). Nothing else is happening with the clusters at all. I am using low specs servers (2 cores, 8 GB of RAM) but I didn't see any of these high load issues with 0.9.3 on same-specs servers. Has something changed about Ceph or else that might explain this? I've also tried with two providers, Hetzner Cloud and UpCloud. Same issue when actually using a volume.

Is it just me or is it happening to others as well? Thanks!

davidkarlsen commented 5 years ago

I see the same (but chart v0.9.3) - it will just freeze and system load will increase and increase.

doing iostat -x 1 you will see 100% util on the RDB devices, but not actually any I/O

vitobotta commented 5 years ago

I have tried with a new single-node cluster from scratch this time with 4 cores from UpCloud, which also happens to have the fastest disks (by far) I've seen with the cloud providers I have tried, so it's unlikely a problem with disks. Well, exactly the same problem. :( After a little while downloading many largish files like videos, the server became totally unresponsive I couldn't even SSH into it again. Like I said earlier, with the previous version of Rook I could do exactly the same operation (basically I am testing the migration of around 25GB of Nextcloud data from an old pre-Kubernetes server) even with servers with just 2 cores using Rook v0.9.3. I am going to try again with this version...

vitobotta commented 5 years ago

Also, since I couldn't SSH into the servers I checked the web console from UpCloud and saw this:

Screenshot 2019-05-08 at 13 54 31

Not sure if it's helpful.... I was also wondering whether there are issues using Rook v1.0 with K3S, since I've used K3S with these clusters (but also with v0.9.3 which was OK). Perhaps I should also try with standard Kubernetes just to see if there's a problem there. I'll do this now..

bengland2 commented 5 years ago

@vitobotta , I've seen this hung-task message when something like RBD or Cephfs is unresponsive and a VM thinks that the I/O subsystem is hung. So the question then becomes why is Ceph unresponsive? Is the Ceph cluster healthy when this happens? ceph health detail. Can you get a dump of your ceph parameters using the admin socket, something like "ceph daemon osd.5 config show". Does K8S show any Ceph pods in bad state?

You may want to pay attention to memory utilization by OSDs. What is the CGroup memory limit for rook.io OSD pods and what is the ceph.conf-defined osd_memory_target set to? Default for osd_memory_target is 4 GiB, much higher than default for OSD pod "resources": "limits". This can cause OSDs to exceed the CGroup limit. Can you do a "kubectl describe nodes" and look at what the memory limits for the different Ceph pods actually are? You may want to raise limits in cluster.yaml and/or lower osd_memory_target. Let me know if this helps. See this article on osd_memory_target

vitobotta commented 5 years ago

Hi @bengland2, yes the clusters (I have tried with several) were healthy etc when I was doing these tests.

In the meantime I have recreated the whole thing again but this time with OpenEBS instead of Rook just to test, and while OpenEBS was slower I didn't have any issues at all, with load never above 4.

With Rook, same test on same specs it reached 40 or even more until I had to forcefully reboot, and this happened more than once. I am going to try once again with OpenEBS to see if I was just lucky...

bengland2 commented 5 years ago

@vitobotta Sounds like you are copying files to an RBD volume. Try lowering your kernel dirty pages way down (e.g. sysctl -w vm.dirty_ratio=3 vm.dirty_background_ratio=1) on your RBD client and see if that makes write response times more reasonable. Also, maybe you need to give your OSDs more RAM, in rook this is done with resources: parameter. A Bluestore OSD expects to have > 4 GiB of RAM by default. Older rook.io may not be doing this by default. Ask me if you need more details.

vitobotta commented 5 years ago

The weird thing is that I didn't seem to have these issues with the previous version, using the same specs and config. Not sure of what rbd client you mean, I just mounted the /dev/rbX device into a directory :)

bengland2 commented 5 years ago

@vitobotta by "RBD client" I meant the host where you mounted /dev/rbdX. Also I expect you are using Bluestore not Filestore OSDs.

vitobotta commented 5 years ago

I think Filestore since I was using a directory on the main disk rather than additional disks.

bengland2 commented 5 years ago

Filestore is basically in maintenance mode at this point, you should be using Bluestore, which has much more predictable write latency. Let us know if Bluestore is giving you trouble.

vitobotta commented 5 years ago

Hi @bengland2, I didn't read anywhere that Filestore (thus directories support?) is in not in active development, I must have missed it... I will try with additional disks instead of directories so I can test with Bluestore when I have time.

Today I had a chance to repeat the test with that 25GB of mixed data with a new 3-node cluster with Rook 1.0 installed. The test started well until it was extracting/copying videos, at which point once again the load average climbed quite quickly to over 70 on a node and 40 on another, so I had to forcefully reboot the two nodes. I uninstalled/cleaned up Rook completely, and repeated the test with OpenEBS first, and Longhorn after that. OpenEBS was again very very slow but worked, while Longhorn reached a load of max 12 when processing videos but then it completed the task and I was able to move on.

Also this time I am running standard Kubernetes 1.13.5, not K3S, so I have excluded both that it could be a problem with K3S, and that it could be a problem with the provider I was using before (Hetzner Cloud).

I don't know what to say... I hoped I could use Rook because it's faster and I have heard good things, but for me from these tests it looks almost unusable when dealing with large files. At least that's the impression I have unfortunately :(

I will try with disks instead of directories when I have a chance. Thanks

vitobotta commented 5 years ago

I can't believe it! :D

I decided to try Bluestore now because I want to understand what's going on, so I set up a new cluster this time with DigitalOcean (3x 4 cores, 8GB ram) and added volumes to the droplets, so to use these disks with Ceph instead of a directory on the main disk. I was able to complete the usual test and the load never went above 5 when extracting videos!

I don't think it's because of DigitalOcean vs Hetzner Cloud/UpCloud, I guess the problem was as you suggested Filestore with directories. But out of curiosity why is there such a big difference in performance and CPU usage between Filestore and Bluestore? Thanks! I'm gonna try the experiment once again just in case, and if it works I will be very happy! :)

vitobotta commented 5 years ago

Tried again and had the same problem. :(

BlaineEXE commented 5 years ago

I believe this may be an issue with Ceph itself. It's my understanding that the Ceph OSDs with Bluestore can use a lot of CPU in some circumstances. I think this is especially true for clusters with many OSDs and clusters with very fast OSDs.

Bluestore will generally result in better performance compared to Filestore, but the performance also comes with more CPU overhead. It's also my understanding that in today's hardware landscape, Ceph performance is often bottlenecked by CPU.

Update: I created a Ceph issue here https://tracker.ceph.com/issues/40068

bengland2 commented 5 years ago

@BlaineEXE To see what they are doing about it, see project crimson. Ceph was designed in a world of HDDs, with 3 orders of magnitude less random IOPS per device. So yes it needs an overhaul, and they are doing that. Ceph is not the only application that is dealing with this.

vitobotta commented 5 years ago

An update... as suggested by @BlaineEXE I did my usual test but using the latest Mimic image instead of Nautilus. It worked just fine with two clusters and managed to finish copying the test data with a very low CPU usage. I repeated this twice with two different clusters, successfully both times. For the third test, I just updated Ceph to Nautilus on the second cluster, and surprisingly the test finished ok again. But then I created a new cluster with Nautilus from the start and boom, usual problem. Started OK until I had to forcefully reboot the server. This is a single node cluster (4 cores, 16 GB of ram) with Rook 1.0.1 on Kubernetes 1.13.5 deployed with Rancher. There's a problem somewhere, I just wish I knew where.

BlaineEXE commented 5 years ago

Is @sp98 still working on this issue? It would be great to see if there are any noticeable differences between how Rook starts up a Mimic cluster compared to how it starts up a Nautilus cluster to determine if Rook is the cause. We should also pay close attention to the behavior of ceph-volume, as the initial OSD prep using Mimic's c-v could be different than the prep using Nautilus' c-v.

sp98 commented 5 years ago

@BlaineEXE Yes. But had to move to 2696. Will jump back up to this one in few days time. Thanks for those updates above. I'll try that and update my findings here.

vitobotta commented 5 years ago

Just tried once again with a new cluster, again with the latest version from the start, same problem. As of now I am still unable to actually use Rook/Ceph :( It's not like there are things that I could do wrong because it's so easy to install etc... so I don't know where to look. This time the problem occurred very quickly after I started copying data into a volume.

I was wondering, could it be something related to using a volume directly bypassing kubernetes?

Not sure if it's helpful, but what I am trying to do is download some data from an existing server into a volume so that I can use that data with Nextcloud. In order to do this, because there are timeouts etc if I try to do it from inside a pod, this is what I do to use the volume directly:

rbd map <pvc> -p replicapool

which gives me the device name, e.g. /dev/rbd3

mkdir nextcloud-data
mount /dev/rbd3 nextcloud-data
mkdir -p nextcloud-data/_migration
cd nextcloud-data/_migration/
ssh old-server "cat /data/containers/nextcloud.tar" | tar -xvv

It starts downloading the data and then at some random point, sooner or later load will climb very very quickly up to 70-80 or more until I have to forcefully reboot the server. Since as said I don't know where to look, I am really confused by this problem and I even thought it may have something to do with the fact that I am extracting the archive while downloading it (I know, it doesn't make sense), but the problem occurs also if I just download the archive somewhere first and then extract it into the volume.

I am new to all of this so I wish I had more knowledge on how to further investigate :( I don't have any of these issues when using Longhorn or OpenEBS but I would prefer using Rook at this point for the performance and because Ceph is an established solution, while the others are very new and have their own issues.

travisn commented 5 years ago

@vitobotta Can you confirm if you have had this issue when deploying mimic (v13) or only with nautilus (v14)? If you haven't tried mimic with rook 1.0, could you try that combination? It would be helpful to confirm if this is a nautilus issue, or if it's rook 1.0 that causes the issue and it happens on both mimic and nautilus.

@markhpc @bengland2 Any other secrets up your sleeve for tracking down perf issues besides what has been mentioned? Thanks!

vitobotta commented 5 years ago

Hi @travisn , I did a couple of tests with Mimic the other day and didn't have any problems with it. I just tried again with Mimic (v13.2.5-20190410) right now and all was good. Since I was always using Rook 1.0.1, it seems like it may be an issue with Nautilus? I am using Ubuntu 18.04 with 4.15.0-50-generic, if that helps somehow. Once I did a test with Fedora 29 (I think?) as suggested by @galexrt and it worked fine, I don't know if I was just lucky.... perhaps I can try again... to see if it happens only with Ubuntu.

vitobotta commented 5 years ago

Hi all, I have done some more tests with interesting results. By "tests" I don't mean anything scientific since I lack deeper understanding of how this stuff works. I mean the usual download of data into a volume as described earlier.

I have repeated the same test with multiple operating systems and these are the results:

I don't know enough about this stuff to jump to conclusions but is it possible that there is a problem with Nautilus and the default Ubuntu 18.04 kernel? To exclude the possibility that it might be a problem with the customised kernel used by the provider, I have tried on Hetzner Cloud, UpCloud and DigitalOcean with the same result: the problem occurs with the default kernel but not with 5.0.0.15.

Is there anyone so kind as to try and reproduce this? Please note that I as far as I remember I haven't seen the problem copying little amounts of data. It always happens when I copy that 24-5 GB of data that I am trying to migrate or also sometimes when I run a benchmark on a volume with fio. Thanks a lot in advance if someone can reproduce this / look into it. :)

vitobotta commented 5 years ago

Guys... I tried again with the 5.0.0.15 kernel and it happened again :( The first test copying the data into the volume was fine, but then I did a backup with Velero followed by a restore and the system became unresponsive during the restore, as usual...

dyusupov commented 5 years ago

I'm not a Ceph expert but I do deal with lots of storage systems, architectures, and technologies. My first recommendation would be to consult with optimal hardware requirements to ensure that software operates within boundaries of tested matrix:

http://docs.ceph.com/docs/jewel/start/hardware-recommendations/

I would not recommend to just go with minimal requirements. Double the numbers. Verify that your test environment conforms with that.

On Sat, Jun 1, 2019 at 6:06 AM Vito Botta notifications@github.com wrote:

Guys... I tried again with the 5.0.0.15 kernel and it happened again :( The first test copying the data into the volume was fine, but then I did a backup with Velero followed by a restore and the system became unresponsive during the restore, as usual...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rook/rook/issues/3132?email_source=notifications&email_token=AAJVGV4AMJMJOTOXKXIGZW3PYJX4XA5CNFSM4HLNELB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWXAI5Y#issuecomment-497943671, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJVGVZ3K6RJOCBGBAQFZ2TPYJX4XANCNFSM4HLNELBQ .

ftab commented 5 years ago

I'm running into this problem as well. It's causing my Kubernetes node to flap back and forth between NotReady and Ready; containers fail to start up, even a web browser or the system monitor lock up. The system ends up with over 1,000 processes eventually, and I think it's also causing my VirtualBox to not be able to start

Currently on a bare metal single node master--k8s 1.14.1, rook deployed from release-1.0 branch with storageclass-test.yml and cluster-test.yml (except that databaseSizeMB, journalSizeMB, and osdsPerDevice was commented out).

Host is running Ubuntu 18.04.2 (currently 4.18.0-20-generic kernel) and has 2x 10-core Xeon (20 cores, 40 threads total) with 96 GB of registered DDR4 running at 2133. 1 TB 970 EVO Plus NVMe drive. Suffice it to say, it should have plenty of CPU, RAM, and I/O speed...

edit: iostat -x 1 shows utilization going very high on the NVME device most of the time - but almost no utilization (0-1%) on the rbd devices

vitobotta commented 5 years ago

Has anyone had a chance to try and reproduce this issue?

I would not recommend to just go with minimal requirements. Double the numbers. Verify that your test environment conforms with that.

I doubt it is a problem with the specs. First, I have tried with servers with 8 dedicated cores and 32 GB of ram, which I think is more than sufficient for my test workloads; second, the problem appears to occur only with Ubuntu so far. I have just tested twice again with CentOS (same config and same tests as usual, apart from the OS) and everything has worked fine each time. I am going to try once more.

sp98 commented 5 years ago

@vitobotta I'll trying it out soon (was stuck with another ticket which is complete now). Just getting the environment ready for this ( as my local dev environment won't be sufficient to test this out).

vitobotta commented 5 years ago

@vitobotta I'll trying it out soon (was stuck with another ticket which is complete now). Just getting the environment ready for this ( as my local dev environment won't be sufficient to test this out).

Thanks... I think I spoke too soon :( I was doing a restore with Velero/Restic on the CentOS server, and it got stuck at 15 of 24 GB with IO wait at 40+ constantly. I rebooted the server, started the restore again and it's proceeding slowly although the IO wait is almost 0. Load is ~12 now.

vitobotta commented 5 years ago

It completed the second time. I don't know what happened earlier with the iowait, perhaps it was the disk? At least I didn't have the problem as with Ubuntu - i.e. the system didn't become unresponsive. I am going to try with another CentOS server/cluster.

vitobotta commented 5 years ago

Had again the same problem with high iowait with CentOS. The system doesn't become unresponsive like with Ubuntu, so I don't have to reboot it etc, but backups or restores etc get basically stalled because of the iowait.

Is there anyone here using Rook/Ceph with clusters deployed with Rancher? Can it be something related to how Rancher deploys Kuberrnetes? What do you use?

billimek commented 5 years ago

@vitobotta this sounds like a situation where the client workloads and rbd (OSDs) are co-located in the same kernel space?

There is apparently a 'commonly known' issue with ceph when you run clients doing heavy IO in the same kernel space as ceph (OSD) itself. Here is some additional information:

From the ceph documentation:

DO NOT mount kernel clients directly on the same node as your Ceph Storage Cluster, because kernel conflicts can arise.

Also see this post from the ceph mailing list:

The classic case is when under some memory pressure, the kernel tries to free memory by flushing the client's page cache, but doing the flush means allocating more memory on the server, making the memory pressure worse, until the whole thing just seizes up.

If you want to do a lot of reading, you can see my frustrating progression in figuring out this issue in my local rook/ceph setup.

Assuming this is the same type of situation, I suggest separating the ceph OSD runtimes from the client (applications) leveraging the storage backed by ceph to see if this helps solve your problem.

bengland2 commented 5 years ago

@vitobotta It is not clear to me why there would be such conflicts with Bluestore, since Bluestore is not competing for memory with kernel clients (except in TCP stack perhaps). Perhaps the documentation referenced above is now obsolete for bluestore, will take a look. But look out for CGroup memory and CPU limits - you can control these for Ceph daemons in your cluster.yaml, kubectl describe will tell you what they are.

dotnwat commented 5 years ago

Related to #2074 ?

vitobotta commented 5 years ago

Hi @billimek, I am not sure of what you mean, but if I understood rightly yes, both storage with Rook and workloads were running on the same servers. How would I go about separating the OSD runtimes from the applications? Not sure of what you mean. My goal is to use small clusters were all nodes also run the storage.

In the meantime since nobody reproduced my issue I had basically given up and have been using OpenEBS for a few weeks without any problems at all, doing exactly the same things with the same data and setup as I did with Rook. It remains a mystery for me because when testing with my data I had either the system become unresponsive with load going crazy, or ridiculously high IO wait depending on the combination of operating system and kernel. In either case I was always forced to reboot. I don't know if there's something "particular" about my data, it's just mixed stuff - documents, photos, videos, code. And like I said no problems whatsoever with OpenEBS so yeah, a mystery for me... I was looking to use Rook because I could use block storage/fs/object storage with one solution, but it looks like it doesn't like me :)

ftab commented 5 years ago

I don't understand why it's this bad. I was using an older version of rook (I believe 0.7) with the applications and OSDs happily coexisting on the same boxes.

iMartyn commented 5 years ago

I have had this issue multiple times, rebuilt my cluster with v13 and v14 bluestore because of it multiple times, it's actually beyond tolerable. I just get PVCs working nicely, everything running, get to working and then any large data transfer kills the workers.

Latest test : v13 OSDs on two worker nodes but not on the master. create pod that mounts the cephfs on the master. rsync using the host machine to the mountpoint created by the pod, with low bw cap and still boom whilst rsyncing large file. Also confirmed io overload with just using webdav to sync my photos from my phone, so it doesn't appear to be size of file related (avg photo on my phone 2.5Mb).

It's getting tiresome now because every time the worker IO gets that high, only a hard restart resolves it, and that upsets the operator, and I get closer to a broken cluster every time.

mykaul commented 5 years ago

@iMartyn - are you using host network with Ceph?

markhpc commented 5 years ago

Hi Folks,

The way I would approach this now is to try and get some more information about the root cause for the slowdown on an OSD showing high CPU utilization. It might be a bit harder to do this on an OSD inside a container, but here are some ideas:

1) To start out with just look at all of your osd and system logs to see if there's anything obvious like timeouts, tainted kernel, driver issues, etc.

2) Given that the problem may only manifest on certain linux distributions, you might want to try a tool a wrote that shows you the differences between multiple sysctl output:

https://github.com/ceph/cbt/blob/master/tools/compare_sysctl.py

The idea here is that you run something like "sudo sysctl -a > ubuntu_1804.sysctl.out" on one host and "sudo sysctl -a > centos7.sysctl.out" on another host. Then you use this tool to show the differences in the configuration settings:

~/src/cbt/tools/compare_sysctl.py centos7.sysctl.out ubuntu_1804.sysctl.out

On my Ubutnu and CentOS test boxes this showed a large number of different settings so it may be difficutl to narrow down if one setting is causing a problem, but at least it may be a place to start.

3) You might want to try running "ceph daemon osd. dump_historic_ops or ceph daemon osd. dump_ops_in_flight and see which internal operations are taking a long time. I prefer using a wallclock profiler (mentioned below), but this may work better in containers.

4) If you have iostat or collectl, it might be worth looking at the service and wait times of the devices underneath filestore/bluestore. IE this might help pinpoint if there's something going on underneath the OSD like a slow or failing device.

5) On bare metal I often utilize my wallclock profiler (https://github.com/markhpc/gdbpmp/) or perf against the OSD to try and get an idea of what it's spending it's time doing. The idea behind the wallclock profiler is that it periodically gathers stacktraces with GDB and then formats them into a callgraph. You'll need access to gdb/python and the ceph debug binaries inside the container though which may be difficult.

billimek commented 5 years ago

@markhpc,

Do you know if it is acceptable or proper to run ceph client workloads (e.g. pods with PVCs residing in ceph/rbd storage) in the same kernel space (same k8s node) where the ceph OSDs are also running?

Is the same true if running heavy IO workloads?

markhpc commented 5 years ago

@billimek From what I recall, the problem with doing that in the past was that in low memory scenarios you could end up using memory to create client requests and then not have enough memory to actually perform the OSD side writes leading to a deadlock. BlueStore may be more resilient to it vs filestore but it's possible it could still happen if you are very memory constrained.

I doubt it's causing this issues though.

iMartyn commented 5 years ago

@mykaul I can confirm the same happens with hostnetwork and without.

iMartyn commented 5 years ago

@markhpc it's kinda impossible to do any debugging in these scenarios. The load goes into the 10s and then beyond. I can't login with kb+mouse on the physical machine. I have checked the logs and the only errors are those such as are shown in @vitobotta's screenshot above. I'm running on exactly the same distro on all machines (debian 9). I was on a mixed debian 9/10 at one point and the errors were happening on the 10 worker node, but downgrading that hasn't helped. So all my devices have the same sysctl config.

I'm installing collectl now to try and gather more info.

One minor point - I note you say high CPU but that's not the reported case for any of us afaict, it's high load, low cpu.

billimek commented 5 years ago

Agreed @iMartyn, my observations were extremely high load (mostly caused my blocked IO it seemed) but relatively low CPU utilization in comparison. I actually captured screenshots of netdata and glances during these times when it was happening.

iMartyn commented 5 years ago

Further data point : I just restarted my dead due to load node worker, and as it came back up, the load on the second worker machine fuji skyrocketed. Just restarting it and I'll see what collectl shows. For info, ceph status at this point shows :

  cluster:
    id:     98a13418-a11d-4728-b5d3-d3a8afc876b0
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            1 osds down
            Reduced data availability: 105 pgs inactive
            Degraded data redundancy: 46400/199668 objects degraded (23.239%), 93 pgs degraded, 93 pgs undersized
            12 slow ops, oldest one blocked for 906 sec, osd.1 has slow ops

  services:
    mon: 1 daemons, quorum a
    mgr: a(active)
    mds: myfs-1/1/1 up  {0=myfs-b=up:replay}, 1 up:standby
    osd: 4 osds: 3 up, 4 in

  data:
    pools:   3 pools, 300 pgs
    objects: 99.83 k objects, 304 GiB
    usage:   551 GiB used, 12 TiB / 13 TiB avail
    pgs:     35.000% pgs unknown
             46400/199668 objects degraded (23.239%)
             105 unknown
             102 active+clean
             93  active+undersized+degraded

and ceph osd tree

ID CLASS WEIGHT   TYPE NAME       STATUS REWEIGHT PRI-AFF
-1       25.46875 root default
-5       12.73438     host fuji
 2   hdd  9.09569         osd.2       up  1.00000 1.00000
 3   hdd  3.63869         osd.3     down  1.00000 1.00000
-3       12.73438     host worker
 0   hdd  3.63869         osd.0       up  1.00000 1.00000
 1   hdd  9.09569         osd.1       up  1.00000 1.00000

It seems your commands above don't work in the toolbox though :

sh-4.2# ceph daemon osd. dump_ops_in_flight
admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
markhpc commented 5 years ago

Hi iMaryn,

One minor point - I note you say high CPU but that's not the reported case for any of us afaict, it's high load, low cpu.

I was just basing that off this line from the original issue:

I am not sure where the problem is but I am seeing very high CPU usage since I started using v1.0.0.

If it's high load with associated high iowait then I agree we are probably more interesting in what's happen on the disks.

sh-4.2# ceph daemon osd. dump_ops_in_flight

I think you are missing the id after "osd.". Try something like "osd.0"?

The collectl data should be useful. Is there any chance these nodes are hitting swap? (sorry if the answer is in the graphs/system data, I haven't gotten a chance to look closely yet)

iMartyn commented 5 years ago

Collectl raw log : https://nextcloud.martyn.berlin/s/GN7gAzK7szk59Ar - I'm not very sure how to read this, and I hope you can download it before my cluster goes down due to high load again!

iMartyn commented 5 years ago

Yeah I tried with numbers and still get

sh-4.2# ceph daemon osd.3 dump_ops_in_flight
admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
iMartyn commented 5 years ago

And no, swap is disabled on these nodes as k8s and swap are mostly incompatible (kubeadm refuses to install on swap-enabled machines)

markhpc commented 5 years ago

To get disk stats:

collectl -sD -oT -p fuji-20190725-211504.raw.gz

A couple of interesting bits. sda is showing high util with lots of large read IO that tapered off toward the end of the collection period:

collectl -sD -oT -p fuji-20190725-211504.raw.gz | grep -E "Pct|Util|sda"

#                   <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size  Wait  RWSize  QLen  Wait SvcTim Util
21:16:10 sda          82652      2  167  495     4      11      0    5    2    31     481     1     5      4   81
21:16:20 sda          75601      4  189  400    53       8      0    4    2    45     390    11    53      4   88
21:16:30 sda          81272     18  176  461    17       7      0    4    2    27     449     3    17      4   85
21:16:40 sda          83425      2  168  496     4       8      0    5    2    17     482     1     5      4   81
21:16:50 sda          91644      4  186  494     4       7      0    4    2    24     482     1     5      4   88
21:17:00 sda          92236      4  186  495     4       7      0    4    2    29     483     1     5      4   89
21:17:10 sda          91548      2  183  499     4       8      0    5    2    28     485     1     5      4   89
21:17:20 sda          94050      2  189  498     4       8      0    5    2    24     485     1     5      4   90
21:17:30 sda          64800      2  132  492     4      11      0    5    2    22     472     1     5      4   65
21:17:40 sda            819      0   12   67     3      12      0    5    2     5      47     1     4      4    7
21:17:50 sda              1      0    0   16    16       8      0    4    2     5       2     1     6      6    2
21:18:00 sda              0      0    0    0     0       8      0    4    2     5       2     1     5      5    2
21:18:10 sda              0      0    0    0     0       8      0    4    2     5       2     1     5      5    2

I also noticed that sdc had a high wait time spike toward the beginning of the collection period and then got better over time. Still those disks are pretty busy for the last 30 seconds with the CPUs spending a lot of time waiting on IO.

collectl -sD -oT -p fuji-20190725-211504.raw.gz | grep -E "Pct|Util|sdc"

#                   <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size  Wait  RWSize  QLen  Wait SvcTim Util
21:16:10 sdc           2056      0  512    4     0    2305     47   38   61    59       7     4     4      1   58
21:16:20 sdc           4238      0   77   55   710     714     31   17   43    64      52    92   594      7   67
21:16:30 sdc           5182      0  212   24   185      50      8    4   13    47      24    49   182      2   64
21:16:40 sdc             21      0    1   15     1     423     37   15   28     7      26     3     7      2    3
21:16:50 sdc              0      0    0    0     0      10      0    2    5     0       4     1     0      0    0
21:17:00 sdc              0      0    0    0     0      11      0    2    5     0       5     1     0      0    0
21:17:10 sdc            114      0   29    4     0      31      4    4    9     1       4     1     0      0    2
21:17:20 sdc            936      0  234    4     0       8      0    2    4     2       4     1     0      0   13
21:17:30 sdc            438      0  108    4     0     538     16   18   29    12       7     2     2      1   14
21:17:40 sdc          17094      0  189   90    10      37      5    4    9    25      88     3    11      3   62
21:17:50 sdc          27376      0  276   99    11      19      1    3    7    56      98     3    12      3   98
21:18:00 sdc          27771      1  282   99    11      32      5    3   10    48      97     3    11      3   97
21:18:10 sdc          27392      0  276   99    11      30      3    3   10    49      98     3    12      3   97

sdb looked relatively idle comparatively. In terms of the ceph-osd processes:

collectl -sZ -oT -p fuji-20190725-211504.raw.gz | grep -E "Command|ceph-osd"

#Time      PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT  UsrT Pct  AccuTime  RKB  WKB MajF MinF Command
21:17:00  3544  root     20  3481   53 S    1G  397M  3  0.79  1.89   4  00:07.27 1894  567    0  866 ceph-osd 
21:17:00  3545  root     20  3480   53 S  994M  241M  2  0.75  2.43   5  00:07.55  81K    8    0  356 ceph-osd 
21:18:00  3544  root     20  3481   53 S    1G  428M  3  0.60  0.95   2  00:08.82  12K  112    0  172 ceph-osd 
21:18:00  3545  root     20  3480   53 S  994M  260M  2  0.38  1.40   2  00:09.33  40K   10    0  131 ceph-osd 

Nothing really dramatic there. Based on the RKB/WKB numbers I'd guess that 3544 is the osd on sdc and 3545 is the osd on sda. Curious at the low memory usage for the OSDs here though. Do you have a lower osd_memory_target than default?

iMartyn commented 5 years ago

nope, all default rook settings. That machine isn't very well endowed memory wise - 8Gb in total. I would expect both disks to be fairly slow (sda is a wd red 4tb and sdc is a 10tb WD mybook disk). sdb is an ssd boot drive and therefore not in the ceph cluster.