radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

extasy 0.1.4beta mac->ARCHER restart #169

Closed ebreitmo closed 9 years ago

ebreitmo commented 9 years ago

Hi,

I ran the original coco-amber run successfully from my Mac on ARCHER. Then I changed 'start_iter = 2' in cocoamber.wcfg and I keep the backup-directory on my MAC.

Cycle : 2 Starting Analysis [Callback]: ComputeUnit '55224b304c917a03411c63ef' state changed to PendingInputStaging. [Callback]: ComputePilot '55224b134c917a03411c63ed' state changed to PendingActive. [Callback]: ComputePilot '55224b134c917a03411c63ed' state changed to Active. [Callback]: ComputeUnit '55224b304c917a03411c63ef' state changed to StagingInput. [Callback]: ComputeUnit '55224b304c917a03411c63ef' state changed to PendingExecution. [Callback]: ComputeUnit '55224b304c917a03411c63ef' state changed to Scheduling. [Callback]: ComputeUnit '55224b304c917a03411c63ef' state changed to Executing. [Callback]: ComputeUnit '55224b304c917a03411c63ef' state changed to StagingOutput. [Callback]: ComputeUnit '55224b304c917a03411c63ef' state changed to Done. Analysis Execution time : 71.952 Starting Simulation …

[Callback]: ComputeUnit '55224c284c917a03411c63f7' state changed to Failed.

ERROR

ComputeUnit 55224c284c917a03411c63f7 has FAILED. Can't recover.

On ARCHER: ebreitmo@eslogin002:/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-55224b134c917a03411c63ed/unit-55224c284c917a03411c63f7> more STDERR

Unit 9 Error on OPEN: penta.crd
Rank 0 [Mon Apr 6 10:04:45 2015] [c2-1c0s2n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Program received signal SIGABRT: Process abort signal.

ebreitmo commented 9 years ago

I also get a problem when I run from Linux:

Preprocessing stage ... Expecting 8 ncdf files in backup/iter0 folder Expected number of files not found ... Exiting restart

In backup/iter0 are md_0_0.ncdf, ..., md_0_7.ncdf, a total of eight non-empty files.

ebreitmo commented 9 years ago

(As emailed to Vivek on 24th April),

/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-55224fe24c917a03c130b2f1>

ls -lrt total 316 -rwxr-xr-x 1 ebreitmo e290 14620 Apr 6 10:20 default_bootstrapper.sh -rw-r--r-- 1 ebreitmo e290 106287 Apr 6 10:20 radical-pilot-agent.py drwx--S--- 2 ebreitmo e290 4096 Apr 6 10:23 unit-55224fff4c917a03c130b2f3 drwxr-sr-x 5 ebreitmo e290 4096 Apr 6 10:23 staging_area drwx--S--- 2 ebreitmo e290 4096 Apr 6 10:23 unit-552250a24c917a03c130b2fa drwx--S--- 2 ebreitmo e290 4096 Apr 6 10:23 unit-552250a24c917a03c130b2f8 drwx--S--- 2 ebreitmo e290 4096 Apr 6 10:23 unit-552250a24c917a03c130b2f7 drwx--S--- 2 ebreitmo e290 4096 Apr 6 10:23 unit-552250a24c917a03c130b2f6 drwx--S--- 2 ebreitmo e290 4096 Apr 6 10:23 unit-552250a24c917a03c130b2f4 drwx--S--- 2 ebreitmo e290 4096 Apr 6 10:24 unit-552250a24c917a03c130b2f9 drwx--S--- 2 ebreitmo e290 4096 Apr 6 10:24 unit-552250a24c917a03c130b2fb drwx--S--- 2 ebreitmo e290 4096 Apr 6 10:24 unit-552250a24c917a03c130b2f5 -rw------- 1 ebreitmo e290 30205 Apr 6 10:26 AGENT.STDOUT -rw------- 1 ebreitmo e290 62398 Apr 6 10:26 AGENT.STDERR -rw------- 1 ebreitmo e290 58125 Apr 6 10:26 AGENT.LOG

All the subdirectories have a core.

ls -lrt unit-552250a24c917a03c130b2fa/ total 9344 lrwxrwxrwx 1 ebreitmo e290 99 Apr 6 10:23 penta.top -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-55224fe24c917a03c130b2f1/staging_area/penta.top lrwxrwxrwx 1 ebreitmo e290 105 Apr 6 10:23 penta.crd -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-55224fe24c917a03c130b2f1/staging_area/iter0/penta.crd lrwxrwxrwx 1 ebreitmo e290 96 Apr 6 10:23 min.in -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-55224fe24c917a03c130b2f1/staging_area/min.in lrwxrwxrwx 1 ebreitmo e290 105 Apr 6 10:23 min2.crd -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-55224fe24c917a03c130b2f1/staging_area/iter2/min26.crd -rwx------ 1 ebreitmo e290 360 Apr 6 10:23 radical_pilot_cu_launch_script-73w1qH.sh -rw------- 1 ebreitmo e290 137 Apr 6 10:23 STDOUT -rw------- 1 ebreitmo e290 1016 Apr 6 10:23 STDERR -rw------- 1 ebreitmo e290 2548 Apr 6 10:23 min2.out -rw------- 1 ebreitmo e290 10121216 Apr 6 10:23 core

more unit-552250a24c917a03c130b2fa/STDERR

Unit 9 Error on OPEN: penta.crd
Rank 0 [Mon Apr 6 10:23:50 2015] [c2-1c0s2n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

0 0x8ABA6D in _gfortran_backtrace at backtrace.c:258

1 0x8947E0 in _gfortrani_backtrace_handler at compile_options.c:129

2 0x90C01F in raise

3 0x90BFDB in raise at pt-raise.c:41

4 0x91C5D0 in abort at abort.c:92

5 0x7DED71 in MPID_Abort

6 0x7BFBD2 in MPI_Abort

7 0x797BB4 in pmpi_abort

8 0x4A2156 in __pmemd_lib_mod_MOD_mexit

9 0x4ADE49 in __file_io_mod_MOD_amopen

10 0x41F363 in __inpcrd_dat_mod_MOD_init_inpcrd_dat

11 0x4CF27E in __master_setup_mod_MOD_master_setup

12 0x4B65AC in MAIN__ at pmemd.F90:0

_pmiu_daemon(SIGCHLD): [NID 01930] [c2-1c0s2n2] [Mon Apr 6 10:23:50 2015] PE RANK 0 exit signal Aborted [NID 01930] 2015-04-06 10:23:50 Apid 13483701: initiated application termination

ebreitmo commented 9 years ago

And extasy.log should be in https://gist.github.com/anonymous/aa9597580653691413f2

ebreitmo commented 9 years ago

I re-ran it with 0.1.4-beta, same as before:

EXTASY_DEBUG=True RADICAL_PILOT_VERBOSE='debug' SAGA_VERBOSE='debug' extasy --RPconfig archer.rcfg --Kconfig cocoamber.wcfg 2> extasy.log ExTASY version : 0.1.4-beta Session UID: 553fa3cc4c917a3d6252c8f7 Pilot UID : 553fa3cc4c917a3d6252c8f9 Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/amber.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/coco.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/gromacs.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/lsdmap.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/mmpbsa.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/namd.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/sleep.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/test.json Preprocessing stage .... Expecting 8 ncdf files in backup/iter0 folder Expected number of files not found ... Exiting restart ... Closing session, exiting now ...

Where

ls -lrt backup/iter0/ total 2048 -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:32 md_0_0.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:36 md_0_5.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:38 md_0_4.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:41 md_0_1.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:44 md_0_7.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:46 md_0_2.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:49 md_0_6.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:51 md_0_3.ncdf

vivek-bala commented 9 years ago

I think this is with an older version of Radical Pilot. RP 0.29 still is facing issue https://github.com/radical-cybertools/radical.pilot/issues/587 and ExTASY has not been tested on archer yet. I'll get right on it when 587 is fixed.

On Tue, Apr 28, 2015 at 11:16 AM, ebreitmo notifications@github.com wrote:

I re-ran it with 0.1.4-beta, same as before:

EXTASY_DEBUG=True RADICAL_PILOT_VERBOSE='debug' SAGA_VERBOSE='debug' extasy --RPconfig archer.rcfg --Kconfig cocoamber.wcfg 2> extasy.log ExTASY version : 0.1.4-beta Session UID: 553fa3cc4c917a3d6252c8f7 Pilot UID : 553fa3cc4c917a3d6252c8f9 Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/amber.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/coco.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/gromacs.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/lsdmap.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/mmpbsa.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/namd.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/sleep.json Loading kernel configurations from /Users/elenabreitmoser/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/test.json Preprocessing stage .... Expecting 8 ncdf files in backup/iter0 folder Expected number of files not found ... Exiting restart ... Closing session, exiting now ...

Where

ls -lrt backup/iter0/ total 2048 -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:32 md_0_0.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:36 md_0_5.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:38 md_0_4.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:41 md_0_1.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:44 md_0_7.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:46 md_0_2.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:49 md_0_6.ncdf -rw------- 1 elenabreitmoser staff 129768 28 Apr 10:51 md_0_3.ncdf

— Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/ExTASY/issues/169#issuecomment-97102381 .

andre-merzky commented 9 years ago

https://github.com/radical-cybertools/radical.pilot/issues/587 is fixed by now, with a release or RP-v0.30.

vivek-bala commented 9 years ago

This should be fixed in RP 0.31 and devel branch of extasy.

ebreitmo commented 9 years ago

I used 0.1.4-beta-12-g1a55c39. First I had 'start_iter=0', 'num_iterations=2', then restarted with 'start_iter=2', 'num_iterations=2'.

It works fine now and the issue can be closed.