radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

COCO/Amber on ARCHER - hangs in Cycle 1 MD #155

Closed ibethune closed 9 years ago

ibethune commented 9 years ago

As discussed with Vivek already via Skype, this is a recreate of the problem I identified on Friday, now using the latest devel branch - ExTASY version : 0.1.3.1-beta-15-gf2a2457

All files and the host logs are in /work/e290/e290/shared/iain/recreate on ARCHER.

In short, during the Cycle 1 in the simulation stage, the first 8 CUs (the minimisations) are submitted and run correctly. Subsequently, only 4 MD CUs are submitted, they complete OK, and then the job hangs until the wallclock timer runs out, killing the pilot.

ibethune commented 9 years ago

On further examination, I see that the cause of the hang is that one of the 8 minimisation CUs is stuck in staging_output until the wallclock runs out:

2015:03:30 11:29:57 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '551925a4d7bf757f1dd7508a' state changed from 'New' to 'PendingInputStaging'.
...
2015:03:30 11:29:57 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '551925a4d7bf757f1dd7508a' state changed from 'PendingInputStaging' to 'StagingInput'.
...
2015:03:30 11:29:58 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '551925a4d7bf757f1dd7508a' state changed from 'StagingInput' to 'PendingExecution'.
...
2015:03:30 11:29:58 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '551925a4d7bf757f1dd7508a' state changed from 'PendingExecution' to 'Scheduling'.
...
2015:03:30 11:29:58 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '551925a4d7bf757f1dd7508a' state changed from 'Scheduling' to 'Executing'.
...
2015:03:30 11:30:04 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '551925a4d7bf757f1dd7508a' state changed from 'Executing' to 'StagingOutput'
...
Up to here has taken only a few seconds from initial creation of the CU through to completion of execution
...  
2015:03:30 11:45:07 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '551925a4d7bf757f1dd7508a' state changed from 'StagingOutput' to 'Canceled'.
...
Finally, it is cancelled when the pilot is killed

Note that the file that was staged out (md11.crd) actually made it to the staging_area/iter1 directory OK.

andre-merzky commented 9 years ago

It could be that this has the same underlying reason as #156. Can you please try to run again with 'EXTASY_DEBUG=TRUE', and send me the session ID afterwards?

Thanks!

ibethune commented 9 years ago

I always run with EXTASY_DEBUG=True (as per the instructions, I hope it's not case-sensitive?). After setting the .saga.cfg file I ran again and the ID was 55195ff9d7bf756af82a8c36 . This time it worked OK. I will try a few more runs to see if the problem has cleared up. If I get one to fail I'll let you know.

andre-merzky commented 9 years ago

the EXTASY_DEBUG: Vivek can probably clarify on the casing. I could not find the session record in mongodb though, so my guess would be that it indeed failed...

Either way, lets see what your further tests show... - thanks!

vivek-bala commented 9 years ago

Setting EXTASY_DEBUG to True should work. The default mongodb used in ExTASY is mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot

ibethune commented 9 years ago

Closing, fixes by the .saga.cfg file (or env var)