radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

Example runtime exceeds walltime #77

Closed oleweidner closed 9 years ago

oleweidner commented 9 years ago

On Archer, still in the first cycle, the pilot finishes before all CUs are done. We have three options:

(1) Increase runtime (2) Reduce tasks / step (3) Both

Also, we need to understand if the very slow processing of tasks is determined by GROMACS/LSDMap or if there is something wron with the script or RADICAL Pilot (i.e., no concurrent execution of tasks). I don't have a good intuition here. Do people have some data how a single GROMACS and a single LSDMap step takes with the test data that we are using?

andre-merzky commented 9 years ago

Re pilot performance: Ole, can you do a run and leave the session records in place, and send me the session ID?

Thanks, Andre.

oleweidner commented 9 years ago

Does this one work: 54341e680941de73976c1830 ?

ibethune commented 9 years ago

Hi Ole I was also looking into this yesterday independently. I left a gromacs/lsdmap workflow running overnight with a queue limit of 6 hours and it didn't complete (although it looks as though something went wrong about an hour after the jobs started). Unfortunately I was running without verbose logging and don't even have the output since I was running inside screen and don't have a scrollback buffer. I will try again and see if I can capture the full debug output to understand what happened.

Looking at the 'md.log' files in the unit-* directories it seems that each actual gromacs calculation takes less than a second:

Log file opened on Tue Oct 7 21:48:09 2014 Host: nid02291 pid: 14735 rank ID: 0 number of ranks: 1 GROMACS: gmx mdrun, VERSION 5.0

Core t (s) Wall t (s) (%) Time: 0.029 0.662 4.4 (ns/day) (hour/ns) Performance: 130.650 0.184 Finished mdrun on rank 0 Tue Oct 7 21:48:10 2014 So not sure what it is taking all the time doing!
oleweidner commented 9 years ago

This came up in our discussion: https://github.com/radical-cybertools/radical.pilot/issues/410

andre-merzky commented 9 years ago

guys,

you can set RADICAL_PILOT_BENCHMARK=1 for your runs, and leave the session around with 'session.close(delete=False)'. When you send us the session ID, we can look closer at the performance statistics and core utilization afterwards.

I am not sure if those settings need to be made on user end or on the extasy tooling and in this context though, apologies if this is off target here...

The run reference by Ole (54341e680941de73976c1830) has no core utilization (I forgot to ask Ole to set the benchmark env), but otherwise I see:

I attach a plot. This might be somewhat cryptic, apologies, but Ole should be able to interpret it.

Best, Andre.

On Wed, Oct 8, 2014 at 10:27 AM, ibethune notifications@github.com wrote:

Hi Ole I was also looking into this yesterday independently. I left a gromacs/lsdmap workflow running overnight with a queue limit of 6 hours and it didn't complete (although it looks as though something went wrong about an hour after the jobs started). Unfortunately I was running without verbose logging and don't even have the output since I was running inside screen and don't have a scrollback buffer. I will try again and see if I can capture the full debug output to understand what happened.

Looking at the 'md.log' files in the unit-* directories it seems that each actual gromacs calculation takes less than a second:

Log file opened on Tue Oct 7 21:48:09 2014 Host: nid02291 pid: 14735 rank ID: 0 number of ranks: 1 GROMACS: gmx mdrun, VERSION 5.0

Core t (s) Wall t (s) (%) Time: 0.029 0.662 4.4 (ns/day) (hour/ns) Performance: 130.650 0.184 Finished mdrun on rank 0 Tue Oct 7 21:48:10 2014

So not sure what it is taking all the time doing!

— Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/ExTASY/issues/77#issuecomment-58325697 .

It was a sad and disappointing day when I discovered that my Universal Remote Control did not, in fact, control the Universe. (Not even remotely.)

andre-merzky commented 9 years ago

PS: RADICAL_PILOT_BENCHMARK=1 imposes a fairly small, but real runtime overhead -- please don't use for production.

On Wed, Oct 8, 2014 at 10:42 AM, Andre Merzky andre@merzky.net wrote:

guys,

you can set RADICAL_PILOT_BENCHMARK=1 for your runs, and leave the session around with 'session.close(delete=False)'. When you send us the session ID, we can look closer at the performance statistics and core utilization afterwards.

I am not sure if those settings need to be made on user end or on the extasy tooling and in this context though, apologies if this is off target here...

The run reference by Ole (54341e680941de73976c1830) has no core utilization (I forgot to ask Ole to set the benchmark env), but otherwise I see:

  • units are started rather quickly, no problem there
  • pilot utilization seems close enough to 100%
  • unit runtimes seems to be equally distributed in a certain range, is that correct?
  • the pilot shuts down correctly when the unit wait queue becomes empty.

I attach a plot. This might be somewhat cryptic, apologies, but Ole should be able to interpret it.

Best, Andre.

On Wed, Oct 8, 2014 at 10:27 AM, ibethune notifications@github.com wrote:

Hi Ole I was also looking into this yesterday independently. I left a gromacs/lsdmap workflow running overnight with a queue limit of 6 hours and it didn't complete (although it looks as though something went wrong about an hour after the jobs started). Unfortunately I was running without verbose logging and don't even have the output since I was running inside screen and don't have a scrollback buffer. I will try again and see if I can capture the full debug output to understand what happened.

Looking at the 'md.log' files in the unit-* directories it seems that each actual gromacs calculation takes less than a second:

Log file opened on Tue Oct 7 21:48:09 2014 Host: nid02291 pid: 14735 rank ID: 0 number of ranks: 1 GROMACS: gmx mdrun, VERSION 5.0

Core t (s) Wall t (s) (%) Time: 0.029 0.662 4.4 (ns/day) (hour/ns) Performance: 130.650 0.184 Finished mdrun on rank 0 Tue Oct 7 21:48:10 2014

So not sure what it is taking all the time doing!

— Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/ExTASY/issues/77#issuecomment-58325697 .

It was a sad and disappointing day when I discovered that my Universal Remote Control did not, in fact, control the Universe. (Not even remotely.)

It was a sad and disappointing day when I discovered that my Universal Remote Control did not, in fact, control the Universe. (Not even remotely.)

oleweidner commented 9 years ago

I have added an EXTASY_DEBUG environment variable - if set, the session is retained: https://github.com/radical-cybertools/ExTASY/blob/devel/src/radical/ensemblemd/extasy/bin/runme.py#L145

ibethune commented 9 years ago

OK, lots of output, so uploading it externally

Stdout from extasy tool: http://pastebin.com/xTDk7TVW Contents of CWD : http://pastebin.com/WXDFAvPd (note out[0|1|2|3|5|7].gro are missing) Contents of pilot sandbox directory: http://www2.epcc.ed.ac.uk/~ibethune/files/pilot-5434fbdcd7bf75bb29c104a5.tar.gz (sorry 58 MB file!)

Again the indication from the md.log file in the unit* directories is that each gromacs run takes less than a second. I'll leave you to have a look and see if you can figure out what is happening!

I also note that the run fails eventually since it can't find 'out3.gro' locally. This appears to be due to an SFTP failure earlier in the job. It looks something like the error Elena reported in https://github.com/radical-cybertools/ExTASY/issues/56 - any ideas?

jp43 commented 9 years ago

Hi, I agree, the GROMACS/LSDMap example should finish rapidly (a few minutes at worst). It might not be related to this problem but whenever I try to run GROMACS/LSDMap on stampede, the MD simulations get stuck in the StagingOutput phase forever. Looking at the units folders, the MD simulations seem to finish correctly the way it does in your case, that is, the STDERR files are empty and the STDOUT look like this:

TACC: Starting up job 4247696 TACC: Setting up parallel environment for MVAPICH2+mpispawn. TACC: Starting parallel tasks...

TACC: Shutdown complete. Exiting.

Setting RADICAL_PILOT_VERBOSE to DEBUG doesn't provide any additional information about this though.

FYI, here is the messages I got on the localhost

Session UID: 543577de71fc6f20e85779c1 Pilot UID : 543577e071fc6f20e85779c3 Loading kernel configurations from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/radical.ensemblemd.mdkernels-0.1-py2.7.egg/radical/ensemblemd/mdkernels/configs/amber.json Loading kernel configurations from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/radical.ensemblemd.mdkernels-0.1-py2.7.egg/radical/ensemblemd/mdkernels/configs/coco.json Loading kernel configurations from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/radical.ensemblemd.mdkernels-0.1-py2.7.egg/radical/ensemblemd/mdkernels/configs/gromacs.json Loading kernel configurations from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/radical.ensemblemd.mdkernels-0.1-py2.7.egg/radical/ensemblemd/mdkernels/configs/lsdmap.json Loading kernel configurations from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/radical.ensemblemd.mdkernels-0.1-py2.7.egg/radical/ensemblemd/mdkernels/configs/mmpbsa.json Loading kernel configurations from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/radical.ensemblemd.mdkernels-0.1-py2.7.egg/radical/ensemblemd/mdkernels/configs/namd.json Loading kernel configurations from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/radical.ensemblemd.mdkernels-0.1-py2.7.egg/radical/ensemblemd/mdkernels/configs/sleep.json Loading kernel configurations from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/radical.ensemblemd.mdkernels-0.1-py2.7.egg/radical/ensemblemd/mdkernels/configs/test.json [Callback]: ComputePilot '543577e071fc6f20e85779c3' state changed to Launching. Prepare grofiles.. Preprocessing time : 0.97299695015 Cycle 0 Starting Simulation [Callback]: ComputeUnit '543577e371fc6f20e85779cb' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779c9' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779d0' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779c7' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779c6' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779cc' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779d1' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779d2' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779d3' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779ca' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779cf' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779c8' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779cd' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779d4' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779ce' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779c5' state changed to PendingInputStaging. [Callback]: ComputeUnit '543577e371fc6f20e85779cb' state changed to StagingInput. [Callback]: ComputeUnit '543577e371fc6f20e85779c9' state changed to StagingInput. [Callback]: ComputeUnit '543577e371fc6f20e85779d0' state changed to StagingInput. [Callback]: ComputeUnit '543577e371fc6f20e85779cb' state changed to PendingExecution. [Callback]: ComputeUnit '543577e371fc6f20e85779c7' state changed to StagingInput. [Callback]: ComputePilot '543577e071fc6f20e85779c3' state changed to PendingActive. [Callback]: ComputeUnit '543577e371fc6f20e85779c6' state changed to StagingInput. [Callback]: ComputeUnit '543577e371fc6f20e85779c9' state changed to PendingExecution. [Callback]: ComputeUnit '543577e371fc6f20e85779cc' state changed to StagingInput. [Callback]: ComputeUnit '543577e371fc6f20e85779d1' state changed to StagingInput. [Callback]: ComputeUnit '543577e371fc6f20e85779d0' state changed to PendingExecution. [Callback]: ComputeUnit '543577e371fc6f20e85779d2' state changed to StagingInput. [Callback]: ComputeUnit '543577e371fc6f20e85779d3' state changed to StagingInput. [Callback]: ComputeUnit '543577e371fc6f20e85779c7' state changed to PendingExecution. [Callback]: ComputeUnit '543577e371fc6f20e85779ca' state changed to StagingInput. [Callback]: ComputeUnit '543577e371fc6f20e85779cf' state changed to StagingInput. [Callback]: ComputeUnit '543577e371fc6f20e85779c6' state changed to PendingExecution. [Callback]: ComputeUnit '543577e371fc6f20e85779c8' state changed to StagingInput. [Callback]: ComputePilot '543577e071fc6f20e85779c3' state changed to Active. [Callback]: ComputeUnit '543577e371fc6f20e85779cb' state changed to Scheduling. [Callback]: ComputeUnit '543577e371fc6f20e85779c9' state changed to Scheduling. [Callback]: ComputeUnit '543577e371fc6f20e85779cb' state changed to Executing. [Callback]: ComputeUnit '543577e371fc6f20e85779c9' state changed to Executing. [Callback]: ComputeUnit '543577e371fc6f20e85779c7' state changed to Scheduling. [Callback]: ComputeUnit '543577e371fc6f20e85779c7' state changed to Executing. [Callback]: ComputeUnit '543577e371fc6f20e85779c6' state changed to Scheduling. [Callback]: ComputeUnit '543577e371fc6f20e85779c6' state changed to Executing. [Callback]: ComputeUnit '543577e371fc6f20e85779d0' state changed to Scheduling. [Callback]: ComputeUnit '543577e371fc6f20e85779d0' state changed to Executing. [Callback]: ComputeUnit '543577e371fc6f20e85779d0' state changed to StagingOutput. [Callback]: ComputeUnit '543577e371fc6f20e85779c6' state changed to StagingOutput. [Callback]: ComputeUnit '543577e371fc6f20e85779c7' state changed to StagingOutput. [Callback]: ComputeUnit '543577e371fc6f20e85779cb' state changed to StagingOutput. [Callback]: ComputeUnit '543577e371fc6f20e85779c9' state changed to StagingOutput.

then it gets stuck

oleweidner commented 9 years ago

All - what is the status on this ticket?

oleweidner commented 9 years ago

This was fixed