markovmodel / adaptivemd

A python framework to run adaptive Markov state model (MSM) simulation on HPC resources
GNU Lesser General Public License v2.1
18 stars 7 forks source link

MSM analysis worker #29

Closed thempel closed 7 years ago

thempel commented 7 years ago

I just had the scenario that an analysis failed because of the DB file limits mentioned in #26 followed by a worker crash (walltime). After restarting everything, the worker printed

Queued a task [PythonTask] from generator `pyemma`
task did not complete

which apparently was caused by

print task.stderr.message
ln: die symbolische Verknüpfung „input.json“ konnte nicht angelegt werden: Die Datei existiert bereits
ln: die symbolische Verknüpfung „_run_.py“ konnte nicht angelegt werden: Die Datei existiert bereits
ln: die symbolische Verknüpfung „input.pdb“ konnte nicht angelegt werden: Die Datei existiert bereits
ln: die symbolische Verknüpfung „_file__rpc_output_748c032e-3ff1-4391-845b-a01cdea3796d.json“ konnte nicht angelegt werden: Die Datei existiert bereits
jhprinz commented 7 years ago

Ah yes, makes sense. This happened:

The folder used when you retry a failed worker is the same as the old one. And if there are files left some commands might fail. Since I use set -e now it should fail as it did. Although if all of these appear then it probably is only a warning. Not sure.

Solution

We need to cleanup failed workers as we should anyway. When a worker create the task directory it should be cleaned before continuing.

Also, we can add an option to erase successful task directories to cleanup unnecessary folders.