martymac / fpart

Sort files and pack them into partitions
https://www.fpart.org/
BSD 2-Clause "Simplified" License
230 stars 39 forks source link

Problem running fpsync with GNU parallel or in background from the shell, the process will go in STOP state #43

Closed liuk001 closed 1 year ago

liuk001 commented 1 year ago

Hello, I'm using fpart/fpsync for a large dataset migration, and I find it a very efficient tool. To optimize the final delta migration I'd like to run several fpsync in parallel and I'm trying to use GNU parallel but I'm stuck on a problem since it seems that fpsync will go in the stop state when it creates the first rsync child process. Even running fpsync from the shell with "&", after few seconds the process get STOPPED. If I run several concurrent fpsync from tmux there are no problems at all, but using GNU parallel would be the best way to achieve my goal. I'm running latest fpart version 1.5.1 compiled from source on Ubuntu 20.04. Thanks in advance for any hint. Best regards Luca

liuk001 commented 1 year ago

Hello, just digging further I found that with fpart 1.4.0 the issue is not present. How to reproduce on 1.5.1:

the rsync job is created but is NOT executed and the whole process will hang.

I will try to compare fpsync 1.4.0 with 1.5.1 later.

Thanks, Luca

martymac commented 1 year ago

Hello Luca,

Thanks for your interest in fpsync.

Unfortunately, I cannot reproduce the hangs on FreeBSD (even using bash) with the GNU parallel call you gave. Also, no hangs when fpsync is simply executed in the background that way :

$ fpsync /tmp/src/test/ /tmp/dst/test/ &

I'll try to reproduce the hangs on Ubuntu and get back to you. Shells differ (so does job control handling), it may be related to that.

Anyway, by parallelizing fpsync, you kind of short-circuit the tool as it has been design to parallelize the transfers on its own. Your ls command will probably get unbalanced "data partitions" (return small and big directories), which is far from efficient. Fpsync tries to produce balanced partitions and gets them replicated in parallel (or serially with -n 1) by itself.

Also, be careful because your initial 'ls -1' command will output directories and files while fpsync only takes directories as arguments ; files in the root directory will be skipped. You will also skip dot directories and files from the root directory as they are not returned by your command.

If the aim is to parallelize the final synchronization pass, please have a look at option -E. It may suit your needs and will probably be more efficient than trying to parallelize fpsync itself :)

Best regards,

Ganael.

liuk001 commented 1 year ago

Hello Ganael! Thank you for your kind and quick feedback! I forgot to specify that in my case the source folder contains ONLY folders so I'm safe with that parallel invocation. Something like: /path/source/a /path/source/b .... /path/source/z My idea behind the reason for using parallel is that in the final phase of the migration I've a small set of files that have been added and deleted, I'm already using the option "-E" but rsync jobs are very quick to run and I'm in a situation where the crawler is running slower than the rsync jobs creation rate and so elapsed time is too long for my requirements. So I decided to use GNU parallel in order to have parallel crawling at the top level folders (a, b, ... z) in order to minimize the time for the final sync.

I took a look at the differences between fpsync 1.4 and 1.5.1 and didn't found anything useful at the moment, this behaviour is really strange. BTW, my bash version is 5.0.17(1)-release on Ubuntu 20.04.

I'll keep you posted on my findings. Best regards, Luca

martymac commented 1 year ago

Hello Luca,

FYI, I have tried to reproduce the problem on Debian 11 (I have no access to an Ubuntu 20.04 machine) but I couldn't. Fpsync finished the copy and did not hang, both with GNU Parallel and when started in background.

Have you found something on your side ?

liuk001 commented 1 year ago

Hello Ganael! I've used version 1.4.0 for the ongoing migrations, in the mean time I've reproduced the issue also on Ubuntu 22.04 (my laptop). It's really strange! No ideas at the moment. I'll try to investigate further in the next days. Thanks Luca

martymac commented 1 year ago

Hello,

I could reproduce the problem on Ubuntu 22.04. If you run fpsync in the background, like that (for example) :

$ fpsync -f 10000 /usr/src/ /var/tmp/dst/ &

it is immediately stopped and you can resume the whole process by typing the fg command.

This only seems to happen with dash (when /bin/sh -> dash). I do not get the same behaviour neither with bash, nor with FreeBSD's /bin/sh.

I think the process group is getting a SIGTTIN (21) signal which immediately stops it, but this is not easy to diagnose. I'll try to dig further into that after the holidays.

Meanwhile, what you can do is using bash instead of dash, either by changing fpsync's shebang or by reconfigure /bin/sh link to point to bash instead of dash :

$ sudo dpkg-reconfigure dash

Merry Christmas and happy new year!

Ganael.

liuk001 commented 1 year ago

Hello Ganael, great findings, thank you! Really strange behavior for the dash Actually I would prefer to change fpsync's shebang just to be sure that it is using bash in a deterministic way.

Merry Christmas and happy new year!

Luca

martymac commented 1 year ago

Hello Luca,

Well, here is what I have found so far.

I have been able to reproduce the bug with dash 0.5.12 on FreeBSD (13.1-RELEASE, amd64) too.

It is triggered when switching on monitor mode (set -m), exactly here :

https://github.com/martymac/fpart/blob/fpart-1.5.1/tools/fpsync#L1354

Dash immediately sends a SIGTTIN signal that stops the whole process group.

This is reproducible with the following test :

$ cat test.sh
set -m
$ dash test.sh &
[1]  + suspended (tty input)  dash test.sh

Other shells do not have the same behaviour :

$ bash test.sh &
[1]  + done       bash test.sh
$ sh test.sh &
[1]  + done       sh test.sh
$ ksh93 test.sh &
[1]  + done       ksh93 test.sh

A simple way to fix that is to ignore SIGTTIN globally from the main fpsync process. This is a bug and should have be done but if I apply that simple fix, dash remains stuck using 100% CPU!

Digging into dash code, it is stuck in the following loop:

https://git.kernel.org/pub/scm/utils/dash/dash.git/tree/src/jobs.c?h=v0.5.12#n208

This is easily reproducible that way :

$ cat test2.sh
trap '' 21
set -m
$ dash test2.sh &

We may have hit a dash bug here, I'll report it and let you know.

This means I could probably fix your bug by ignoring SIGTTIN from within fpsync (I'll do it anyway, as it seems reasonable) but this will not fix your problem with dash (yet).

Meanwhile and as already explained, you can always change the script's shebang to use another shell.

Cheers,

Ganael.

liuk001 commented 1 year ago

Hi Ganael, great findings, thanks for the update! Changing the script's shebang is surely the best solution IMHO. Cheers, Luca

martymac commented 1 year ago

FYI, I've started a discussion here: https://lore.kernel.org/dash/7091680.J8PY2HnTC3@home.martymac.org/T/#u

martymac commented 1 year ago

Hello Luca,

I presume I can't do much more to fix the problem on fpsync side (let's wait for an answer on dash ML).

I'll close that PR for the moment. Feel free to re-open it if necessary.

Best regards,

Ganael.