Closed jpardey closed 1 year ago
HI there, Thanks for using it and apologies for not patching pfp2 faster. I should have updated it immediately once the new version of fpart emerged, especially since I asked for the change (!). I'm just about done with some major changes to pfp2 which add special handling for very large and zillions of tiny files. I'll try to test and push the changes to github this weekend. Thanks again for the note and appreciation. I'll ping you once I push it. Harry
On Thu, Jan 26, 2023 at 6:05 PM jpardey @.***> wrote:
First off, thanks for this great tool!
My first attempt to run parsyncfp2 hung. Checking the code, I noticed line 878 https://github.com/hjmangalam/parsyncfp2/blob/e280c565ffdef236859ad18b510fa8e701a7d2f5/parsyncfp2#L878 had a note to change .0 to .1, and I can see in the fpart changelog they mention starting their output with .1. I made this change, and started the $CUR_FPI iteration at 1.
I'm pretty sure this also affects anything that's compared to $CUR_FPI. In my local version, I've changed the other side of most comparisons.
After these changes, parsyncfp2 has been working incredibly well.
If you'd like a PR, I could put something together, but I haven't spent a long time with parsyncfp2.
— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YY7ANAIUBHBVEEAFODWUMUONANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
--
Harry Mangalam
@jpardey Would you be willing to share a diff of your changes? I have encountered the same issue.
My apologies. new pfp2 (2.51) pushed which addresses this and many other problems. Let me know what it breaks, what you don't like. Harry
On Wed, Mar 1, 2023 at 6:18 AM Gabe T. @.***> wrote:
@jpardey https://github.com/jpardey Would you be willing to share a diff of your changes? I have encountered the same issue.
— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5#issuecomment-1450226845, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y2OUCELCOFWM7HACYTWZ5LBVANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you commented.Message ID: @.***>
--
Harry Mangalam
@hjmangalam Perhaps this is worth a new issue as I haven't used parsyncfp2 long enough to know if the behavior has changed. The new version is working well for me for single-host transfers and for multihost transfers for which I use a relative source path. Things go awry, though, when I try to do a multihost transfer while using --startdir
. Here is a (sanitized) example:
The command:
parsyncfp2 --NP=24 --chunksize=1G --verbose=3 --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests --hosts='hpc-cr-055=hpcc-dx01-pr,hpc-cr-056=hpcc-dx02-pr' --startdir=/homes/04/user tests POD::/homes/01/user
My expectation is that hosts hpc-cr-055
and hpc-cr-056
will transfer /homes/04/user/tests
to /homes/01/user/tests
via hosts hpcc-dx01-pr
and hpcc-dx02-pr
. If I omit --startdir
and start the transfer from within /homes/04/user
, it works as expected.
While when using --startdir=/homes/04/user
, commands like these are run:
hpc-fx-102 WARN: About to send this REMOTE COMMAND to SENDHOST [hpc-cr-055]
[ssh hpc-cr-055 "export PATH=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \
/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/parsyncfp2 --date=18.05.45_2023-03-16 \
--mstr_md5=0100e04749a11a272cb6ba73d59324ef \
--nowait --verbose=3 --maxload=48 --slowdown=0.9514 \
--startdir=/homes/04/user --skipfpart --fpstart=1 --fpstride=2 \
--verbose=3 --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests --startdir=/homes/04/user /homes/04/user \
hpcc-dx01-pr:/homes/01/user 2> /dev/null \
|& tee -a /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-055/pfp-log-18.05.45_2023-03-16 "]
(also written to [/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-055/pfp-log-18.05.45_2023-03-16])
hpc-cr-055 INFO: Using [bond1] to send data and to monitor
hpc-fx-102 WARN: About to send this REMOTE COMMAND to SENDHOST [hpc-cr-056]
[ssh hpc-cr-056 "export PATH=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \
/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/parsyncfp2 --date=18.05.45_2023-03-16 \
--mstr_md5=0100e04749a11a272cb6ba73d59324ef \
--nowait --verbose=3 --maxload=48 --slowdown=0.9514 \
--startdir=/homes/04/user --skipfpart --fpstart=2 --fpstride=2 \
--verbose=3 --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests --startdir=/homes/04/user /homes/04/user \
hpcc-dx02-pr:/homes/01/user 2> /dev/null \
|& tee -a /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-056/pfp-log-18.05.45_2023-03-16 "]
(also written to [/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-056/pfp-log-18.05.45_2023-03-16])
hpc-cr-056 INFO: Using [bond1] to send data and to monitor
I then get Killed by signal 1
from both sshes, but no indication why they failed. I can see from the commands shown, though, that --startdir
is in the command twice and the source appears to be the same as the --startdir
value.
Looking at the code, I wonder if it is in this block that sets sdpath
that something is off.
Hi Gabe,
Yes, you're right - I was messing around in that block trying to tighten it up and seem to have messed up something - there are some bizarre string outputs. I'm surprised, since I've been using that to do some testing on other systems and it should NOT have worked, but apparently does in some cases. I'll try to put it right by tomorrow, with some other fixes as well. Thanks very much for the note. Harry
On Thu, Mar 16, 2023 at 4:52 PM Gabe T. @.***> wrote:
@hjmangalam https://github.com/hjmangalam Perhaps this is worth a new issue as I haven't used parsyncfp2 ling enough to know if the behavior has changed. The new version is working well for me for single-host tranfers and for multihost transfers for which I use a relative source path. Things go awry, though, when I try to do a multihost transfer while using --startdir. Here is a (sanitized) example:
The command:
parsyncfp2 --NP=24 --chunksize=1G --verbose=3 --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests --hosts='hpc-cr-055=hpcc-dx01-pr,hpc-cr-056=hpcc-dx02-pr' --startdir=/homes/04/user tests POD::/homes/01/user
My expectation is that hosts hpc-cr-055 and hpc-cr-056 will transfer /homes/04/user/tests to /homes/01/user/tests via hosts hpcc-dx01-pr and hpcc-dx02-pr. If I omit --startdir and start the transfer from within /homes/04/user, it works as expected.
While when using --startdir=/homes/04/user, commands like these are run:
hpc-fx-102 WARN: About to send this REMOTE COMMAND to SENDHOST [hpc-cr-055] [ssh hpc-cr-055 "export PATH=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \ /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/parsyncfp2 --date=18.05.45_2023-03-16 \ --mstr_md5=0100e04749a11a272cb6ba73d59324ef \ --nowait --verbose=3 --maxload=48 --slowdown=0.9514 \ --startdir=/homes/04/user --skipfpart --fpstart=1 --fpstride=2 \ --verbose=3 @.*** --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests --startdir=/homes/04/user /homes/04/user \ hpcc-dx01-pr:/homes/01/user 2> /dev/null \ |& tee -a /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-055/pfp-log-18.05.45_2023-03-16 "] (also written to [/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-055/pfp-log-18.05.45_2023-03-16]) hpc-cr-055 INFO: Using [bond1] to send data and to monitor
hpc-fx-102 WARN: About to send this REMOTE COMMAND to SENDHOST [hpc-cr-056] [ssh hpc-cr-056 "export PATH=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \ /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/parsyncfp2 --date=18.05.45_2023-03-16 \ --mstr_md5=0100e04749a11a272cb6ba73d59324ef \ --nowait --verbose=3 --maxload=48 --slowdown=0.9514 \ --startdir=/homes/04/a3r8szz --skipfpart --fpstart=2 --fpstride=2 \ --verbose=3 --nowait --commondir=/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests --startdir=/homes/04/user /homes/04/user \ hpcc-dx02-pr:/homes/01/user 2> /dev/null \ |& tee -a /panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-056/pfp-log-18.05.45_2023-03-16 "] (also written to [/panfs/ultra/other/hpcc-data-migration/_homes_04_user_tests/.pfp2/hpc-cr-056/pfp-log-18.05.45_2023-03-16]) hpc-cr-056 INFO: Using [bond1] to send data and to monitor
I then get Killed by signal 1 from both sshes, but no indication why they failed. I can see from the commands shown, though, that --startdir is in the command twice and the source appears to be the same as the --startdir value.
Looking at the code, I wonder if it is in this block https://github.com/hjmangalam/parsyncfp2/blob/369b2cad1cce3ad5c876a2590108258d0b578ab6/parsyncfp2#L994-L1021 that sets sdpath that something is off.
— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5#issuecomment-1472908665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y4UPNJWYNMREUVL6JDW4ORT7ANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you were mentioned.Message ID: @.***>
--
Harry Mangalam
@hjmangalam The ssh disconnection was my fault (I was trying to wrap script -c
in a shell script), but the processing of startdir and source directory seem to still be an issue. I noticed one more oddity: If the source directory has a hyphen in its name, it becomes prepended to the POD::/destination
argument with an =. E.g. I was trying to transfer the directory conda-sf-env
and in the remote command run on the send host it became conda-cf-env=POD::/destination
.
Thanks for all your work providing this great utility!
Changing the match on this line to /^-/
resolved the =POD::
issue for my use case, though it is not sufficient to handle a (perhaps absurd) case in which a source directory begins with -
:)
Thanks. Looking at it now. Also found another edge case bug when using different users on different hosts. And another test that needed to be done to prevent user confusion about non-existent dirs.. It never ends.. I should have it committed by end of day. harry
On Fri, Mar 17, 2023 at 9:11 AM Gabe T. @.***> wrote:
Changing the match on this line https://github.com/hjmangalam/parsyncfp2/blob/369b2cad1cce3ad5c876a2590108258d0b578ab6/parsyncfp2#L1000 to /^-/ resolved the =POD:: issue for my use case, though it is not sufficient to handle a (perhaps absurd) case in which a source directory begins with - :)
— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5#issuecomment-1474072363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y27SIDNLHW74UT7UILW4SEMJANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you were mentioned.Message ID: @.***>
--
Harry Mangalam
pushed. Thanks again. Let me know what fails now..
harry
On Fri, Mar 17, 2023 at 9:46 AM Harry Mangalam @.***> wrote:
Thanks. Looking at it now. Also found another edge case bug when using different users on different hosts. And another test that needed to be done to prevent user confusion about non-existent dirs.. It never ends.. I should have it committed by end of day. harry
On Fri, Mar 17, 2023 at 9:11 AM Gabe T. @.***> wrote:
Changing the match on this line https://github.com/hjmangalam/parsyncfp2/blob/369b2cad1cce3ad5c876a2590108258d0b578ab6/parsyncfp2#L1000 to /^-/ resolved the =POD:: issue for my use case, though it is not sufficient to handle a (perhaps absurd) case in which a source directory begins with - :)
— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5#issuecomment-1474072363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y27SIDNLHW74UT7UILW4SEMJANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you were mentioned.Message ID: @.***>
--
Harry Mangalam
--
Harry Mangalam
@hjmangalam Will test as soon as I can. I see you also mention a fix for --ro. I was planning to open a new issue because I have been trying to use rsync's --exclude-from
option and could not get parsyncfp2 to take a space-delimited list of rsync options 😆 I will give that a try with the new version. Thanks!
Yeah, when I was checking the SEND host command-line generation, I caught that problem. Double-quoting and then RE-double-quoting internally (to pass on) seems to be the only way to get it to work. Please let me know if you find any exceptions or another way to do it. Harry
On Sat, Mar 18, 2023 at 9:40 AM Gabe T. @.***> wrote:
@hjmangalam https://github.com/hjmangalam Will test as soon as I can. I see you also mention a fix for --ro. I was planning to open a new issue because I have been trying to use rsync's --exclude-from option and could not get parsyncfp2 to take a space-delimited list of rsync options 😆 I will give that a try with the new version. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp2/issues/5#issuecomment-1474905290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y3WCSYCVQCJHGJ3G33W4XQQRANCNFSM6AAAAAAUIFXLH4 . You are receiving this because you were mentioned.Message ID: @.***>
--
Harry Mangalam
First off, thanks for this great tool!
My first attempt to run parsyncfp2 hung. Checking the code, I noticed line 878 had a note to change
.0
to.1
, and I can see in the fpart changelog they mention starting their output with.1
. I made this change, and started the$CUR_FPI
iteration at 1.I'm pretty sure this also affects anything that's compared to
$CUR_FPI
. In my local version, I've changed the other side of most comparisons.After these changes, parsyncfp2 has been working incredibly well.
If you'd like a PR, I could put something together, but I haven't spent a long time with parsyncfp2.