Closed fredrikwidlund closed 9 years ago
Hi,
I can't reproduce this this on my 4-chunkserver setup (master is a separate VM on another host).
Cheers
Alex
@fredrikwidlund did you have any errors from client / chunkserver or master in your syslog ?
Tested latest git version now with master on separate host.
No errors on master/chunkserver. Client repeatedly reports problems connecting to chunkserver.
mfsmount[12171]: can't connect to (****:9422): EADDRNOTAVAIL (Cannot assign requested address) (i.e. client to chunkmaster, running on the same host)
Running both master and chunkserver in foreground mode doesn't show anything suspicious.
Starts off writing ok, then gradually slows down to a trickle that "never" completes. Writes are then done in bursts with long pauses inbetween. Recovers now, however, if you stop the writes, and doesn't hang.
Testing the same setup in MooseFS 3.0.34 without problems, where the chunkserver logs this additional information.
mfschunkserver[1841]: workers: 10+ mfschunkserver[1841]: workers: 20+
Have you tried to analyze network traffic? EADDRNOTAVAIL happens when you cannot bind to an address (e.g. system is out of free port numbers), which is rather unusual.
mfsmount
can be run in foreground as well, maybe it would provide some more debug information.
Also, I assume that you run your installation on RAID, which would probably need some configuration adjustments to be optimal. Patches that optimize this use case (RAID) are under development. Can you try to test this issue on a regular disk? Maybe it is RAID + bad config that generates the bottleneck.
Client and chunkserver is running on the same server, and normally have no problems with connectivity. For example writing files sequentially doesn't result in any issues. The warning start appearing when doing multiple operations in parallel.
Underlying filesystem is 32 drives running XFS in a JBOD config.
The main part of this issue is not read performance, but that parallel writes breaks.
Can you verify if the problem still occurs when your client is not on the same machine as the chunkserver? It would help to find the right direction of debugging. You could also check if changing values of /sys/fs/fuse/connections/XXX/max_background helps, default max value of fuse threads is 12.
I'm afraid the amount of time I have to look further into this is limited right now. If you have tried and can't reproduce it feel free to close the issue on my behalf.
I'm closing the issue since I was unable to reproduce it at all, if you find more time to deal with it, please reopen.
Just curious. Did you try a clean CentOS 7 installation, default LizardFS on same host, and the "dd" write commands above?
Yes I've tried fresh 2.6.0 on clean Centos7, didn't hang. I haven't used underlying filesystem with 32 drives though.
If you don't mind, we can return to this topic after patches that allow flexible chunkserver configuration get published and merged (it will happen in the near future). I think that tuning the parameters of network and hdd usage could solve your problems.
CentOS 7/LizardFS 2.6.0 E5-2609/32GB RAM LSI 9271-8i/32x4TB NL SAS mfsmaster/mfschunkserver on the same node for testing only default configuration with the devices using xfs mapped in mfshdd.cfg
Just a couple of parallel seq writes will hang LizardFS.
dd if=/dev/zero of=/mnt/a bs=1048576 count=10000 & dd if=/dev/zero of=/mnt/b bs=1048576 count=10000 & dd if=/dev/zero of=/mnt/c bs=1048576 count=10000 & dd if=/dev/zero of=/mnt/d bs=1048576 count=10000 &