Open jordancaraballo opened 9 months ago
The most common causes of stalls are:
The client can die due to code exceptions (e.g. missing perl modules, unhandled bug), external interference like reboots or signals from things like manual process kills, or other items like problems with the perl executable or its dependencies that cause a segfault or other type of issue causing termination. Normally, cron will restart the process, but I believe on some NCCS systems (at least the ones I can access), cron has been disabled for normal users. I have an item on my todo list to also support systemd timers at some point.
You can see what operations shift thinks it is doing for a given transfer id N using "shift-mgr --id=N --doing". If it spits out a list of operations, then likely the client has died in some way. If it spits out just empty lists, then it would typically be # 2. If you run "shift-mgr --id=N --meta |grep pids", you can see the list of client processes that should be running on each host. If you go to the host and don't see a pid with the given number (last of the list if more than one comma-separated value), # 1 is the issue. The client tries to report in why it died as its last action, so if it is a code issue, you can run "shift-mgr --id=N --meta |grep exception" to see if it reports anything. Most common is missing perl module since RHEL-based distros don't always include the full perl core by default that shift assumes.
# 2 is normally only a possibility if you are running across multiple hosts. Then if something like the file system servers or mount points changed, it can make it impossible for shift to determine equivalence. I did just fix an issue related to this in the 8.1 release that was actually first found at NCCS. Shouldn't impact you on a single host though and that particular one has been fixed AFAIK.
Anyway, check those first and can go from there.
Hi Paul,
I am using the latest version of shift taken from the master repo and compiled. I am trying to copy the contents of one directory to another directory within a local system. The only caveat which I assume should not matter is that the original data is located on a GPFS filesystem, and is being transferred to an NFS-based file system. For simple context this is being done at the NCCS.
The command I am using is as follows (the problem occurs both if I submit the job on the backend, or if I actively wait/monitor the submission). I can confirm there are no quota issues and filesystem continues to be writable.
The first time, the data started transferring for a total of 2.21TB of a total of 100TB. The process just staled at that point. Re-submitted the job several times and it has been stale since. This is the status from the two jobs:
Any ideas of what could be going on? Any suggestions on how to debug this to find a solution?
Thanks, Jordan