hjmangalam / parsyncfp2

MultiHost parallel rsync wrapper
Other
43 stars 6 forks source link

fpart >= 1.5 causes parsyncfp2 to hang #5

Closed jpardey closed 1 year ago

jpardey commented 1 year ago

First off, thanks for this great tool!

My first attempt to run parsyncfp2 hung. Checking the code, I noticed line 878 had a note to change .0 to .1, and I can see in the fpart changelog they mention starting their output with .1. I made this change, and started the $CUR_FPI iteration at 1.

I'm pretty sure this also affects anything that's compared to $CUR_FPI. In my local version, I've changed the other side of most comparisons.

After these changes, parsyncfp2 has been working incredibly well.

If you'd like a PR, I could put something together, but I haven't spent a long time with parsyncfp2.

hjmangalam commented 1 year ago

HI there, Thanks for using it and apologies for not patching pfp2 faster. I should have updated it immediately once the new version of fpart emerged, especially since I asked for the change (!). I'm just about done with some major changes to pfp2 which add special handling for very large and zillions of tiny files. I'll try to test and push the changes to github this weekend. Thanks again for the note and appreciation. I'll ping you once I push it. Harry

On Thu, Jan 26, 2023 at 6:05 PM jpardey @.***> wrote:

First off, thanks for this great tool!

My first attempt to run parsyncfp2 hung. Checking the code, I noticed line 878 https://github.com/hjmangalam/parsyncfp2/blob/e280c565ffdef236859ad18b510fa8e701a7d2f5/parsyncfp2#L878 had a note to change .0 to .1, and I can see in the fpart changelog they mention starting their output with .1. I made this change, and started the $CUR_FPI iteration at 1.

I'm pretty sure this also affects anything that's compared to $CUR_FPI. In my local version, I've changed the other side of most comparisons.

After these changes, parsyncfp2 has been working incredibly well.

If you'd like a PR, I could put something together, but I haven't spent a long time with parsyncfp2.

— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YY7ANAIUBHBVEEAFODWUMUONANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Harry Mangalam

gt3M commented 1 year ago

@jpardey Would you be willing to share a diff of your changes? I have encountered the same issue.

hjmangalam commented 1 year ago

My apologies. new pfp2 (2.51) pushed which addresses this and many other problems. Let me know what it breaks, what you don't like. Harry

On Wed, Mar 1, 2023 at 6:18 AM Gabe T. @.***> wrote:

@jpardey https://github.com/jpardey Would you be willing to share a diff of your changes? I have encountered the same issue.

— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5#issuecomment-1450226845, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y2OUCELCOFWM7HACYTWZ5LBVANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you commented.Message ID: @.***>

--

Harry Mangalam

gt3M commented 1 year ago

@hjmangalam Perhaps this is worth a new issue as I haven't used parsyncfp2 long enough to know if the behavior has changed. The new version is working well for me for single-host transfers and for multihost transfers for which I use a relative source path. Things go awry, though, when I try to do a multihost transfer while using --startdir. Here is a (sanitized) example:

The command:

parsyncfp2  --NP=24 --chunksize=1G --verbose=3 --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests --hosts='hpc-cr-055=hpcc-dx01-pr,hpc-cr-056=hpcc-dx02-pr' --startdir=/homes/04/user tests POD::/homes/01/user

My expectation is that hosts hpc-cr-055 and hpc-cr-056 will transfer /homes/04/user/tests to /homes/01/user/tests via hosts hpcc-dx01-pr and hpcc-dx02-pr. If I omit --startdir and start the transfer from within /homes/04/user, it works as expected.

While when using --startdir=/homes/04/user, commands like these are run:

hpc-fx-102 WARN: About to send this REMOTE COMMAND to SENDHOST [hpc-cr-055]
  [ssh hpc-cr-055 "export PATH=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \
    /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/parsyncfp2  --date=18.05.45_2023-03-16 \
  --mstr_md5=0100e04749a11a272cb6ba73d59324ef \
  --nowait --verbose=3 --maxload=48 --slowdown=0.9514 \
  --startdir=/homes/04/user  --skipfpart --fpstart=1 --fpstride=2 \
      --verbose=3 --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests  --startdir=/homes/04/user /homes/04/user  \
    hpcc-dx01-pr:/homes/01/user 2> /dev/null \
    |& tee -a /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-055/pfp-log-18.05.45_2023-03-16 "]
  (also written to [/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-055/pfp-log-18.05.45_2023-03-16])
hpc-cr-055 INFO: Using [bond1] to send data and to monitor

hpc-fx-102 WARN: About to send this REMOTE COMMAND to SENDHOST [hpc-cr-056]
  [ssh hpc-cr-056 "export PATH=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \
    /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/parsyncfp2  --date=18.05.45_2023-03-16 \
  --mstr_md5=0100e04749a11a272cb6ba73d59324ef \
  --nowait --verbose=3 --maxload=48 --slowdown=0.9514 \
  --startdir=/homes/04/user  --skipfpart --fpstart=2 --fpstride=2 \
      --verbose=3 --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests  --startdir=/homes/04/user /homes/04/user  \
    hpcc-dx02-pr:/homes/01/user 2> /dev/null \
    |& tee -a /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-056/pfp-log-18.05.45_2023-03-16 "]
  (also written to [/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-056/pfp-log-18.05.45_2023-03-16])
hpc-cr-056 INFO: Using [bond1] to send data and to monitor

I then get Killed by signal 1 from both sshes, but no indication why they failed. I can see from the commands shown, though, that --startdir is in the command twice and the source appears to be the same as the --startdir value.

Looking at the code, I wonder if it is in this block that sets sdpath that something is off.

hjmangalam commented 1 year ago

Hi Gabe,

Yes, you're right - I was messing around in that block trying to tighten it up and seem to have messed up something - there are some bizarre string outputs. I'm surprised, since I've been using that to do some testing on other systems and it should NOT have worked, but apparently does in some cases. I'll try to put it right by tomorrow, with some other fixes as well. Thanks very much for the note. Harry

On Thu, Mar 16, 2023 at 4:52 PM Gabe T. @.***> wrote:

@hjmangalam https://github.com/hjmangalam Perhaps this is worth a new issue as I haven't used parsyncfp2 ling enough to know if the behavior has changed. The new version is working well for me for single-host tranfers and for multihost transfers for which I use a relative source path. Things go awry, though, when I try to do a multihost transfer while using --startdir. Here is a (sanitized) example:

The command:

parsyncfp2 --NP=24 --chunksize=1G --verbose=3 --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests --hosts='hpc-cr-055=hpcc-dx01-pr,hpc-cr-056=hpcc-dx02-pr' --startdir=/homes/04/user tests POD::/homes/01/user

My expectation is that hosts hpc-cr-055 and hpc-cr-056 will transfer /homes/04/user/tests to /homes/01/user/tests via hosts hpcc-dx01-pr and hpcc-dx02-pr. If I omit --startdir and start the transfer from within /homes/04/user, it works as expected.

While when using --startdir=/homes/04/user, commands like these are run:

hpc-fx-102 WARN: About to send this REMOTE COMMAND to SENDHOST [hpc-cr-055] [ssh hpc-cr-055 "export PATH=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \ /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/parsyncfp2 --date=18.05.45_2023-03-16 \ --mstr_md5=0100e04749a11a272cb6ba73d59324ef \ --nowait --verbose=3 --maxload=48 --slowdown=0.9514 \ --startdir=/homes/04/user --skipfpart --fpstart=1 --fpstride=2 \ --verbose=3 @.*** --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests --startdir=/homes/04/user /homes/04/user \ hpcc-dx01-pr:/homes/01/user 2> /dev/null \ |& tee -a /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-055/pfp-log-18.05.45_2023-03-16 "] (also written to [/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-055/pfp-log-18.05.45_2023-03-16]) hpc-cr-055 INFO: Using [bond1] to send data and to monitor

hpc-fx-102 WARN: About to send this REMOTE COMMAND to SENDHOST [hpc-cr-056] [ssh hpc-cr-056 "export PATH=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \ /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/parsyncfp2 --date=18.05.45_2023-03-16 \ --mstr_md5=0100e04749a11a272cb6ba73d59324ef \ --nowait --verbose=3 --maxload=48 --slowdown=0.9514 \ --startdir=/homes/04/a3r8szz --skipfpart --fpstart=2 --fpstride=2 \ --verbose=3 --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests --startdir=/homes/04/user /homes/04/user \ hpcc-dx02-pr:/homes/01/user 2> /dev/null \ |& tee -a /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-056/pfp-log-18.05.45_2023-03-16 "] (also written to [/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-056/pfp-log-18.05.45_2023-03-16]) hpc-cr-056 INFO: Using [bond1] to send data and to monitor

I then get Killed by signal 1 from both sshes, but no indication why they failed. I can see from the commands shown, though, that --startdir is in the command twice and the source appears to be the same as the --startdir value.

Looking at the code, I wonder if it is in this block https://github.com/hjmangalam/parsyncfp2/blob/369b2cad1cce3ad5c876a2590108258d0b578ab6/parsyncfp2#L994-L1021 that sets sdpath that something is off.

— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5#issuecomment-1472908665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y4UPNJWYNMREUVL6JDW4ORT7ANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you were mentioned.Message ID: @.***>

--

Harry Mangalam

gt3M commented 1 year ago

@hjmangalam The ssh disconnection was my fault (I was trying to wrap script -c in a shell script), but the processing of startdir and source directory seem to still be an issue. I noticed one more oddity: If the source directory has a hyphen in its name, it becomes prepended to the POD::/destination argument with an =. E.g. I was trying to transfer the directory conda-sf-env and in the remote command run on the send host it became conda-cf-env=POD::/destination.

Thanks for all your work providing this great utility!

gt3M commented 1 year ago

Changing the match on this line to /^-/ resolved the =POD:: issue for my use case, though it is not sufficient to handle a (perhaps absurd) case in which a source directory begins with - :)

hjmangalam commented 1 year ago

Thanks. Looking at it now. Also found another edge case bug when using different users on different hosts. And another test that needed to be done to prevent user confusion about non-existent dirs.. It never ends.. I should have it committed by end of day. harry

On Fri, Mar 17, 2023 at 9:11 AM Gabe T. @.***> wrote:

Changing the match on this line https://github.com/hjmangalam/parsyncfp2/blob/369b2cad1cce3ad5c876a2590108258d0b578ab6/parsyncfp2#L1000 to /^-/ resolved the =POD:: issue for my use case, though it is not sufficient to handle a (perhaps absurd) case in which a source directory begins with - :)

— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5#issuecomment-1474072363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y27SIDNLHW74UT7UILW4SEMJANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you were mentioned.Message ID: @.***>

--

Harry Mangalam

hjmangalam commented 1 year ago

pushed. Thanks again. Let me know what fails now..

harry

On Fri, Mar 17, 2023 at 9:46 AM Harry Mangalam @.***> wrote:

Thanks. Looking at it now. Also found another edge case bug when using different users on different hosts. And another test that needed to be done to prevent user confusion about non-existent dirs.. It never ends.. I should have it committed by end of day. harry

On Fri, Mar 17, 2023 at 9:11 AM Gabe T. @.***> wrote:

Changing the match on this line https://github.com/hjmangalam/parsyncfp2/blob/369b2cad1cce3ad5c876a2590108258d0b578ab6/parsyncfp2#L1000 to /^-/ resolved the =POD:: issue for my use case, though it is not sufficient to handle a (perhaps absurd) case in which a source directory begins with - :)

— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5#issuecomment-1474072363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y27SIDNLHW74UT7UILW4SEMJANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you were mentioned.Message ID: @.***>

--

Harry Mangalam

--

Harry Mangalam

gt3M commented 1 year ago

@hjmangalam Will test as soon as I can. I see you also mention a fix for --ro. I was planning to open a new issue because I have been trying to use rsync's --exclude-from option and could not get parsyncfp2 to take a space-delimited list of rsync options 😆 I will give that a try with the new version. Thanks!

hjmangalam commented 1 year ago

Yeah, when I was checking the SEND host command-line generation, I caught that problem. Double-quoting and then RE-double-quoting internally (to pass on) seems to be the only way to get it to work. Please let me know if you find any exceptions or another way to do it. Harry

On Sat, Mar 18, 2023 at 9:40 AM Gabe T. @.***> wrote:

@hjmangalam https://github.com/hjmangalam Will test as soon as I can. I see you also mention a fix for --ro. I was planning to open a new issue because I have been trying to use rsync's --exclude-from option and could not get parsyncfp2 to take a space-delimited list of rsync options 😆 I will give that a try with the new version. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5#issuecomment-1474905290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y3WCSYCVQCJHGJ3G33W4XQQRANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you were mentioned.Message ID: @.***>

--

Harry Mangalam