radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

pilot timeout #213

Closed peterkasson closed 8 years ago

peterkasson commented 8 years ago

Running gromacs lsdmap tutorial with multiple iterations appears to hit a pilot timeout, and nothing restarts -> script hangs.

vivek-bala commented 8 years ago

Could you give me the following details please:

1) Which machine is this on ? 2) How many iterations ? How many tasks/iteration (numCUs in Kconfig file) ? 3) What is the assigned walltime ?

andre-merzky commented 8 years ago

I think different people in the room had that problem, in different setups. In peters case, the walltime was set to 20 min, toward archer -- I am not sure about the other settings, but the simulation did not finish before it got into hanging...

vivek-bala commented 8 years ago

I think I can expect this on archer with gromacs simulations. Increasing the iterations or tasks would increase the time required quite drastically on archer. (pre configured values are 8 tasks, 1 iteration, 13-15 mins). As Iain, encountered - 24 CUs, 1 iteration could take upto an hour on archer. This I think will improve with ORTE.

Was this encountered on Stampede as well ? I would expect the required time to grow proportional to the workload on stampede.

vivek-bala commented 8 years ago

but the simulation did not finish before it got into hanging...

As in no proper shutdown (and/or shutdown verbose msgs) once the walltime was hit ?

andre-merzky commented 8 years ago

The problem is indeed not so much that the pilot timed out, but that there was no reaction on it, and EnMD just kept waiting, or so it seemed...

peterkasson commented 8 years ago

FWIW when I boosted the timeout 10x, the job finished nicely.

peterkasson commented 8 years ago

(but as Andre said, it's mostly a question of how the system handles timeouts)

vivek-bala commented 8 years ago

https://github.com/radical-cybertools/radical.ensemblemd/blob/feature/unit_reporting/src/radical/ensemblemd/single_cluster_environment.py#L121-L127

Check for failed + cancelled + done states.

vivek-bala commented 8 years ago

done in enmd devel.