Local transfers just hang after some time

pkolano / shift

High performance/reliability local/remote file transfers with sync and tar capabilities

Other

52 stars 19 forks source link

shiftc --status id | state | dirs | files | file size | date | run | rate | | sums | attrs | sum size | time | left | ---+-------+------------+----------------+--------------+-------+---------------+--------- 7 | run | 4409/4409 | 106761/2680298 | 2.21TB/100TB | 01/28 | 20h58m21s | 29.3MB/s | | 0/5360596 | 0/2684707 | 0.0B/200TB | 17:31 | 1w4d17h29m54s | 8 | run | 0/4306+ | 0/2617110+ | 0.0B/97.3TB+ | 01/29 | 8h5m24s | 0.0B/s | | 0/5234220+ | 0/2621416+ | 0.0B/195TB+ | 06:24 | |

The most common causes of stalls are:

The client has died and cron is disabled/not permitted so shift can't restart itself
The file system mounts have changed in some way and shift is unable to map operations between file systems correctly

The client can die due to code exceptions (e.g. missing perl modules, unhandled bug), external interference like reboots or signals from things like manual process kills, or other items like problems with the perl executable or its dependencies that cause a segfault or other type of issue causing termination. Normally, cron will restart the process, but I believe on some NCCS systems (at least the ones I can access), cron has been disabled for normal users. I have an item on my todo list to also support systemd timers at some point.

You can see what operations shift thinks it is doing for a given transfer id N using "shift-mgr --id=N --doing". If it spits out a list of operations, then likely the client has died in some way. If it spits out just empty lists, then it would typically be # 2. If you run "shift-mgr --id=N --meta |grep pids", you can see the list of client processes that should be running on each host. If you go to the host and don't see a pid with the given number (last of the list if more than one comma-separated value), # 1 is the issue. The client tries to report in why it died as its last action, so if it is a code issue, you can run "shift-mgr --id=N --meta |grep exception" to see if it reports anything. Most common is missing perl module since RHEL-based distros don't always include the full perl core by default that shift assumes.

# 2 is normally only a possibility if you are running across multiple hosts. Then if something like the file system servers or mount points changed, it can make it impossible for shift to determine equivalence. I did just fix an issue related to this in the 8.1 release that was actually first found at NCCS. Shouldn't impact you on a single host though and that particular one has been fixed AFAIK.

Anyway, check those first and can go from there.

pkolano / shift

Local transfers just hang after some time #3