Closed peterkasson closed 8 years ago
Could you give me the following details please:
1) Which machine is this on ? 2) How many iterations ? How many tasks/iteration (numCUs in Kconfig file) ? 3) What is the assigned walltime ?
I think different people in the room had that problem, in different setups. In peters case, the walltime was set to 20 min, toward archer -- I am not sure about the other settings, but the simulation did not finish before it got into hanging...
I think I can expect this on archer with gromacs simulations. Increasing the iterations or tasks would increase the time required quite drastically on archer. (pre configured values are 8 tasks, 1 iteration, 13-15 mins). As Iain, encountered - 24 CUs, 1 iteration could take upto an hour on archer. This I think will improve with ORTE.
Was this encountered on Stampede as well ? I would expect the required time to grow proportional to the workload on stampede.
but the simulation did not finish before it got into hanging...
As in no proper shutdown (and/or shutdown verbose msgs) once the walltime was hit ?
The problem is indeed not so much that the pilot timed out, but that there was no reaction on it, and EnMD just kept waiting, or so it seemed...
FWIW when I boosted the timeout 10x, the job finished nicely.
(but as Andre said, it's mostly a question of how the system handles timeouts)
Check for failed + cancelled + done states.
done in enmd devel.
Running gromacs lsdmap tutorial with multiple iterations appears to hit a pilot timeout, and nothing restarts -> script hangs.