moosefs / moosefs

MooseFS Distributed Storage – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System / Software-Defined Storage
https://moosefs.com
GNU General Public License v2.0
1.67k stars 207 forks source link

[BUG] Excessive CRC calculations trashing mfsmount client process #594

Open anon314159 opened 3 days ago

anon314159 commented 3 days ago

Have you read through available documentation, open Github issues and Github Q&A Discussions? Yes

System information Various (3 independent air-gaped clusters exhibiting the same behavior/performance)

Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...). Moosefs Version: 3.0.118 (rpm distribution)

Operating system (distribution) and kernel version. OS: Redhat Enterprise Edition 8.9 (Ootpa) Kernel: 4.18.0-513.9.1.el8_9.x86_64 Samba: 4.19.3

Hardware / network configuration, and underlying filesystems on master, chunkservers, and clients. Various, but all chunk/master servers have 40GbE interconnects and 40/10GbE clients.

Benchmarking data transfer rates between all servers in the cluster using iperf3 reveal no anomalies (i.e. averaging 34gbits/s with minimal retransmissions).

Local Storage subsystem (XFS) for each chunkserver also performs within expectations, averaging 3GB/s sequential read and 2.8GB/s sequential write.

**Alternative distributed file systems such as GlusterFS, Ceph, and BeeFGS behave normally and offer near line speed when re-exported as NFS-Ganesha (VFS) shares on the exact same hardware (I have a non-production test environment).

hdparm and smartctl show all drives operating within normal performance parameters with no detected medium errors.

How much data is tracked by Moosefs master (order of magnitude)? All fs objects: Varies Total space: Varies Free space: Varies RAM used: Varies last metadata save duration: Varies (Less than 4 seconds across all clusters) mfs_issue

Describe the problem you observed. Excessively high CPU time spend recalculating uncached chunks affecting the mfsmount process. This has the net byproduct of slowing down read performance when used in conjunction with NFS-Ganesha (VFS) reexports. Benchmarks reveal a near 10-fold decrease in sequential and random reads when compared to the native FUSE client. Supposedly the developers support reexporting MFS to NFS via Ganesha (VFS) but in practice the read speeds are generally terrible for any real-world production use.

Example of Sequential Read Tests: Cluster-A (Average 10 Runs, 8GB Test File, MooseFS FUSE): echo 3 > /proc/sys/vm/drop_caches if=/mnt/net_shares/mfs-fuse/test-file of=/dev/null bs=1M count=8192 status=progress 8589934592 bytes (8.6 GiB, 8.0 GiB) copied, 12.2585 s, 701 MB/s

Cluster-A (Average 10 Runs, 8GB Test File, MooseFS FUSE SMB reexport): mount -t nfs nfs-server:/mfs-test /mnt/net_shares/nfs/ -o proto=tcp,sec=sys,vers=4 echo 3 > /proc/sys/vm/drop_caches if=/mnt/net_shares/nfs/test-file of=/dev/null bs=1M count=8192 status=progress 8589934592 bytes (8.6 GiB, 8.0 GiB) copied, 120.2585 s, 70.1 MB/s

Include any warning/errors/backtraces from the system logs. strace -T -fff -p 'pid of test NFS-Ganesha client' reveals an excess number of futex's and epoll api calls. strace -T -fff -p 'pid of mfs mount' reveals an excess number of epoll api calls. cluster_layout

chogata commented 4 hours ago

I'm afraid I don't understand your setup. You have shown two mounts - native MooseFS mount and NFS mount, that serves data from an MooseFS mounted share. Your tests show the second one is much slower. Yes, it's possible (I would even say probable), but that's because it's NFS. The underlying MooseFS works exactly the same, it's NFS that slows things down. But the screenshot you pasted shows calls to functions that are not MooseFS functions, except for possibly the mycrc32 (we have one with that name). The rest are C++ functions, and there is not an ounce of C++ code in MooseFS ;) So what exactly is this sfsmount process, that you are tracing?

In general: We support Ganesha, in that re-exporting MooseFS share via Ganesha works and doesn't throw errors, unlike the same with most built-in (kernel) NFS-es. We never said it's the best solution for re-export, or that it's fast :) Why are you using NFS at all? Maybe another, faster solution would work for your needs? Also, why don't you test MooseFS 4? It has many improvements in algorithms over MooseFS 3.