MDS behind on trimming every 4-5 weeks causing issue for ceph filesystem

akash123-eng commented 2 weeks ago

Hi,

We are using rook-ceph with operator 1.10.8 and ceph 17.2.5. we are using ceph filesystem with 4 mds i.e 2 active & 2 standby MDS every 3-4 weeks filesystem is having issue i.e in ceph status we can see below warnings warnings :

2 MDS reports slow requests 
2 MDS Behind on Trimming
mds.myfs-a(mds.1) : behind on trimming (6378/128) max_segments:128, num_segments: 6378
mds.myfs-c(mds.1):  behind on trimming (6560/128) max_segments:128, num_segments: 6560

to fix it, we have to restart all MDS pods one by one. this is happening every 4-5 weeks.

We have seen many ceph issues related to it on ceph tracker and many people are suggesting to increase mds_cache_memory_limit currently for our cluster mds_cache_memory_limit is set to default 4GB mds_log_max_segments is set to default 128 should we increase mds_cache_memory_limit to 8GB from default 4GB or is there any solution to fix this issue permenantly?

Environment: Kubernetes

OS (e.g. from /etc/os-release): Centos 7.9
Rook version (use rook version inside of a Rook Pod): . rook operator 1.10.8
Storage backend version (e.g. for ceph do ceph -v): Ceph 17.2.5
Kubernetes version (use kubectl version): 1.25.9

akash123-eng commented 2 weeks ago

@Rakshith-R @Madhu-1 can you please help on above ?

Rakshith-R commented 2 weeks ago

https://ceph.io/en/community/connect/

I or madhu are not familiar with such core ceph problems.

please use rook issues for rook specific probelms.

You should reach out on ceph slack or their mailing list for core ceph issues. https://ceph.io/en/community/connect/

rook / rook

MDS behind on trimming every 4-5 weeks causing issue for ceph filesystem #14220