radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

Amber/Coco on ARCHER, using 0.1.3-beta #148

Closed ebreitmo closed 9 years ago

ebreitmo commented 9 years ago

On Friday my Coco/Amber job running from my Mac on ARCHER encountered problems:

ls -lrt -rwx------ 1 ebreitmo e290 358 Feb 27 11:27 radical_pilot_cu_launch_script-uyPGKU.sh -rw------- 1 ebreitmo e290 137 Feb 27 11:27 STDOUT -rw------- 1 ebreitmo e290 1016 Feb 27 11:27 STDERR -rw------- 1 ebreitmo e290 2325 Feb 27 11:27 md0.out -rw------- 1 ebreitmo e290 10121216 Feb 27 11:27 core

more STDERR

Unit 9 Error on OPEN: md0.crd
Rank 0 [Fri Feb 27 11:27:42 2015] [c1-2c2s9n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

0 0x8ABA6D in _gfortran_backtrace at backtrace.c:258

1 0x8947E0 in _gfortrani_backtrace_handler at compile_options.c:129

2 0x90C01F in raise

3 0x90BFDB in raise at pt-raise.c:41

4 0x91C5D0 in abort at abort.c:92

5 0x7DED71 in MPID_Abort

6 0x7BFBD2 in MPI_Abort

7 0x797BB4 in pmpi_abort

8 0x4A2156 in __pmemd_lib_mod_MOD_mexit

9 0x4ADE49 in __file_io_mod_MOD_amopen

10 0x41F363 in __inpcrd_dat_mod_MOD_init_inpcrd_dat

11 0x4CF27E in __master_setup_mod_MOD_master_setup

12 0x4B65AC in MAIN__ at pmemd.F90:0

_pmiu_daemon(SIGCHLD): [NID 03430] [c1-2c2s9n2] [Fri Feb 27 11:27:42 2015] PE RANK 0 exit signal Aborted [NID 03430] 2015-02-27 11:27:42 Apid 13064419: initiated application termination

I can’t re-run it right now, as there are some issues on ARCHER.

Elena

vivek-bala commented 9 years ago

Seems like the required files aren't being staged to the CU. Could you post the entire log please? Also, could you check if the input files are in pilot-*/staging_area ?

I'm not able to login to Archer ssh_exchange_identification: Connection closed by remote host. Is this a cause of the issues you're referring to ? On the archer page, I see that there are issues on the login node.

ebreitmo commented 9 years ago

Please find the log-file attached.

Yes, I encounter the same issue for ARCHER today, and that’s why I can’t do any further tests until this will be fixed.

Cheers, Elena


Dr Elena Breitmoser

EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD

Tel: +44 131 650 6494

On 2 Mar 2015, at 14:38, Vivekanandan (Vivek) Balasubramanian notifications@github.com wrote:

Seems like the required files aren't being staged to the CU. Could you post the entire log please? Also, could you check if the input files are in pilot-*/staging_area ?

I'm not able to login to Archer ssh_exchange_identification: Connection closed by remote host. Is this a cause of the issues you're referring to ? On the archer page, I see that there are issues on the login node.

— Reply to this email directly or view it on GitHub.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

vivek-bala commented 9 years ago

Hi Elena,

I didn't receive any attached files in your email. Could you send in a separate email or as a gist pls.

Vivek

On Mon, Mar 2, 2015 at 10:46 AM, ebreitmo notifications@github.com wrote:

Please find the log-file attached.

Yes, I encounter the same issue for ARCHER today, and that’s why I can’t do any further tests until this will be fixed.

Cheers, Elena


Dr Elena Breitmoser

EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD

Tel: +44 131 650 6494

On 2 Mar 2015, at 14:38, Vivekanandan (Vivek) Balasubramanian < notifications@github.com> wrote:

Seems like the required files aren't being staged to the CU. Could you post the entire log please? Also, could you check if the input files are in pilot-*/staging_area ?

I'm not able to login to Archer ssh_exchange_identification: Connection closed by remote host. Is this a cause of the issues you're referring to ? On the archer page, I see that there are issues on the login node.

— Reply to this email directly or view it on GitHub.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

— Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/ExTASY/issues/148#issuecomment-76735394 .

vivek-bala commented 9 years ago

I think this can be something momentary. One of the simulations succeeds while the other from the same stage fails. Will test again when access is back.

2015:02:27 11:27:41 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'StagingOutput' to 'Done'.
2015:02:27 11:27:42 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'Executing' to 'Failed'.
ibethune commented 9 years ago

Hi Vivek, it would seem that this could be caused by the login failures on ARCHER. Do you think it would be possible when a file transfer fails that the workflow fails at that point, rather than executing subsequent CUs which fail because a file was not available?

vivek-bala commented 9 years ago

The CU should infact fail at input staging. (Log : https://gist.githubusercontent.com/vivek-bala/a2de0d0442091f538348/raw/issue_148).

015:02:27 11:27:11 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingInputStaging' to 'StagingInput'.
2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'StagingInput' to 'PendingExecution'.
2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'StagingInput' to 'PendingExecution'.
2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'PendingExecution' to 'Scheduling'.
2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingExecution' to 'Scheduling'.
2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'Scheduling' to 'Executing'.
2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'Scheduling' to 'Executing'

But seems like that the CU '54f0548e4c917a060183503d' is being registered as successful transfer, it goes to StagingInput before Executing.

If the file was not available in the original directory this would have failed. I believe the file should be present though, since the other CUs execute successfully. I am sure the staging will fail if the file at the source was not present, I am not sure if there is a check on the target/destination or if it is required at all.

But I think if the nodes are stable again, we should retry and see if we encounter this problem again.

ebreitmo commented 9 years ago

It’s queuing…

Cheers, Elena


Dr Elena Breitmoser

EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD

Tel: +44 131 650 6494

On 3 Mar 2015, at 15:14, Vivekanandan (Vivek) Balasubramanian notifications@github.com wrote:

The CU should infact fail at input staging. (Log : https://gist.githubusercontent.com/vivek-bala/a2de0d0442091f538348/raw/issue_148).

015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingInputStaging' to 'StagingInput'. 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'StagingInput' to 'PendingExecution'. 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'StagingInput' to 'PendingExecution'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'PendingExecution' to 'Scheduling'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingExecution' to 'Scheduling'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'Scheduling' to 'Executing'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'Scheduling' to 'Executing' But seems like that the CU '54f0548e4c917a060183503d' is being registered as successful transfer, it goes to StagingInput before Executing.

If the file was not available in the original directory this would have failed. I believe the file should be present though, since the other CUs execute successfully. I am sure the staging will fail if the file at the source was not present, I am not sure if there is a check on the target/destination or if it is required at all.

But I think if the nodes are stable again, we should retry and see if we encounter this problem again.

— Reply to this email directly or view it on GitHub.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ebreitmo commented 9 years ago

Hi Vivek,

I am also trying to run the same stuff from a Linux machine.

/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-54f6da7723d96e0ee51f7b3c/unit-54f6daf423d96e0ee51f7b49> more STDERR

Unit 9 Error on OPEN: md0.crd

Rank 0 [Wed Mar 4 10:14:44 2015] [c2-0c0s4n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - proc ess 0

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

0 0x8ABA6D in _gfortran_backtrace at backtrace.c:258

1 0x8947E0 in _gfortrani_backtrace_handler at compile_options.c:129

2 0x90C01F in raise

3 0x90BFDB in raise at pt-raise.c:41

4 0x91C5D0 in abort at abort.c:92

5 0x7DED71 in MPID_Abort

6 0x7BFBD2 in MPI_Abort

7 0x797BB4 in pmpi_abort

8 0x4A2156 in __pmemd_lib_mod_MOD_mexit

9 0x4ADE49 in __file_io_mod_MOD_amopen

10 0x41F363 in __inpcrd_dat_mod_MOD_init_inpcrd_dat

11 0x4CF27E in __master_setup_mod_MOD_master_setup

12 0x4B65AC in MAIN__ at pmemd.F90:0

_pmiu_daemon(SIGCHLD): [NID 00402] [c2-0c0s4n2] [Wed Mar 4 10:14:44 2015] PE RANK 0 exit signal Abor ted [NID 00402] 2015-03-04 10:14:44 Apid 13089484: initiated application termination

I attach the extasy.log-file again.

Cheers, Elena


Dr Elena Breitmoser

EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD

Tel: +44 131 650 6494

On 3 Mar 2015, at 15:16, Elena Breitmoser e.breitmoser@epcc.ed.ac.uk wrote:

It’s queuing…

Cheers, Elena


Dr Elena Breitmoser

EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD

Tel: +44 131 650 6494

On 3 Mar 2015, at 15:14, Vivekanandan (Vivek) Balasubramanian notifications@github.com wrote:

The CU should infact fail at input staging. (Log : https://gist.githubusercontent.com/vivek-bala/a2de0d0442091f538348/raw/issue_148).

015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingInputStaging' to 'StagingInput'. 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'StagingInput' to 'PendingExecution'. 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'StagingInput' to 'PendingExecution'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'PendingExecution' to 'Scheduling'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingExecution' to 'Scheduling'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'Scheduling' to 'Executing'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'Scheduling' to 'Executing' But seems like that the CU '54f0548e4c917a060183503d' is being registered as successful transfer, it goes to StagingInput before Executing.

If the file was not available in the original directory this would have failed. I believe the file should be present though, since the other CUs execute successfully. I am sure the staging will fail if the file at the source was not present, I am not sure if there is a check on the target/destination or if it is required at all.

But I think if the nodes are stable again, we should retry and see if we encounter this problem again.

— Reply to this email directly or view it on GitHub.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ebreitmo commented 9 years ago

Hi Vivek,

I installed extasy again from scratch.

ebreitmo@eslogin002:/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-54feecb74c917a9428e6fa08/unit-54ff4b964c917a9428e6fa13> more STDERR

Unit 9 Error on OPEN: md0.crd
Rank 0 [Tue Mar 10 19:52:58 2015] [c1-3c0s8n3] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

0 0x8ABA6D in _gfortran_backtrace at backtrace.c:258

1 0x8947E0 in _gfortrani_backtrace_handler at compile_options.c:129

2 0x90C01F in raise

3 0x90BFDB in raise at pt-raise.c:41

4 0x91C5D0 in abort at abort.c:92

5 0x7DED71 in MPID_Abort

6 0x7BFBD2 in MPI_Abort

7 0x797BB4 in pmpi_abort

8 0x4A2156 in __pmemd_lib_mod_MOD_mexit

9 0x4ADE49 in __file_io_mod_MOD_amopen

10 0x41F363 in __inpcrd_dat_mod_MOD_init_inpcrd_dat

11 0x4CF27E in __master_setup_mod_MOD_master_setup

12 0x4B65AC in MAIN__ at pmemd.F90:0

_pmiu_daemon(SIGCHLD): [NID 04835] [c1-3c0s8n3] [Tue Mar 10 19:52:58 2015] PE RANK 0 exit signal Aborted [NID 04835] 2015-03-10 19:52:58 Apid 13176795: initiated application termination ebreitmo@eslogin002:/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-54feecb74c917a9428e6fa08/unit-54ff4b964c917a9428e6fa13> more STDOUT Application 13176795 exit codes: 134

Cheers, Elena


Dr Elena Breitmoser

EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD

Tel: +44 131 650 6494

On 4 Mar 2015, at 10:21, Elena Breitmoser e.breitmoser@epcc.ed.ac.uk wrote:

Hi Vivek,

I am also trying to run the same stuff from a Linux machine.

/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-54f6da7723d96e0ee51f7b3c/unit-54f6daf423d96e0ee51f7b49> more STDERR

Unit 9 Error on OPEN: md0.crd

Rank 0 [Wed Mar 4 10:14:44 2015] [c2-0c0s4n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - proc ess 0

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

0 0x8ABA6D in _gfortran_backtrace at backtrace.c:258

1 0x8947E0 in _gfortrani_backtrace_handler at compile_options.c:129

2 0x90C01F in raise

3 0x90BFDB in raise at pt-raise.c:41

4 0x91C5D0 in abort at abort.c:92

5 0x7DED71 in MPID_Abort

6 0x7BFBD2 in MPI_Abort

7 0x797BB4 in pmpi_abort

8 0x4A2156 in __pmemd_lib_mod_MOD_mexit

9 0x4ADE49 in __file_io_mod_MOD_amopen

10 0x41F363 in __inpcrd_dat_mod_MOD_init_inpcrd_dat

11 0x4CF27E in __master_setup_mod_MOD_master_setup

12 0x4B65AC in MAIN__ at pmemd.F90:0

_pmiu_daemon(SIGCHLD): [NID 00402] [c2-0c0s4n2] [Wed Mar 4 10:14:44 2015] PE RANK 0 exit signal Abor ted [NID 00402] 2015-03-04 10:14:44 Apid 13089484: initiated application termination

I attach the extasy.log-file again.

Cheers, Elena --- Dr Elena Breitmoser EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD Tel: +44 131 650 6494 On 3 Mar 2015, at 15:16, Elena Breitmoser e.breitmoser@epcc.ed.ac.uk wrote: > It’s queuing… > > Cheers, > Elena > > --- > > Dr Elena Breitmoser > > EPCC, University of Edinburgh > JCMB, Room 3401 > Peter Guthrie Tait Road > UK-Edinburgh EH9 3FD > > Tel: +44 131 650 6494 > > On 3 Mar 2015, at 15:14, Vivekanandan (Vivek) Balasubramanian notifications@github.com wrote: > > > The CU should infact fail at input staging. (Log : https://gist.githubusercontent.com/vivek-bala/a2de0d0442091f538348/raw/issue_148). > > > > 015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingInputStaging' to 'StagingInput'. > > 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'StagingInput' to 'PendingExecution'. > > 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'StagingInput' to 'PendingExecution'. > > 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'PendingExecution' to 'Scheduling'. > > 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingExecution' to 'Scheduling'. > > 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'Scheduling' to 'Executing'. > > 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'Scheduling' to 'Executing' > > But seems like that the CU '54f0548e4c917a060183503d' is being registered as successful transfer, it goes to StagingInput before Executing. > > > > If the file was not available in the original directory this would have failed. I believe the file should be present though, since the other CUs execute successfully. I am sure the staging will fail if the file at the source was not present, I am not sure if there is a check on the target/destination or if it is required at all. > > > > But I think if the nodes are stable again, we should retry and see if we encounter this problem again. > > > > — > > Reply to this email directly or view it on GitHub.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ebreitmo commented 9 years ago

Tried again, still a problem.

... [Callback]: ComputeUnit '550b78ad4c917ad482255161' state changed to Failed. #######################

ERROR

#######################

Then checking on ARCHER

ebreitmo@eslogin004:/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-550ae74f4c917ad482255156/unit-550b78ad4c917ad482255161> more STDERR

Unit 9 Error on OPEN: md0.crd
Rank 0 [Fri Mar 20 01:32:53 2015] [c4-1c2s7n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

0 0x8ABA6D in _gfortran_backtrace at backtrace.c:258

1 0x8947E0 in _gfortrani_backtrace_handler at compile_options.c:129

2 0x90C01F in raise

3 0x90BFDB in raise at pt-raise.c:41

4 0x91C5D0 in abort at abort.c:92

5 0x7DED71 in MPID_Abort

6 0x7BFBD2 in MPI_Abort

7 0x797BB4 in pmpi_abort

8 0x4A2156 in __pmemd_lib_mod_MOD_mexit

9 0x4ADE49 in __file_io_mod_MOD_amopen

10 0x41F363 in __inpcrd_dat_mod_MOD_init_inpcrd_dat

11 0x4CF27E in __master_setup_mod_MOD_master_setup

12 0x4B65AC in MAIN__ at pmemd.F90:0

_pmiu_daemon(SIGCHLD): [NID 02461] [c4-1c2s7n1] [Fri Mar 20 01:32:53 2015] PE RANK 0 exit signal Aborted

I copied the directory to /work/e290/e290shared/elena, there is also a core file.

vivek-bala commented 9 years ago

I tried it too. I didn't get this error in particular. Could you post the entire log please ? Is this the first or the second iteration ? I got an error in the coco stage, seems like the module pre reqs for coco, scipy have changed. I have made the change and run it again, it's on the queue. Will update .

ebreitmo commented 9 years ago

Cheers, Elena


Dr Elena Breitmoser

EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD

Tel: +44 131 650 6494

On 22 Mar 2015, at 03:30, Vivekanandan (Vivek) Balasubramanian notifications@github.com wrote:

I tried it too. I didn't get this error in particular. Could you post the entire log please ? Is this the first or the second iteration ? I got an error in the coco stage, seems like the module pre reqs for coco, scipy have changed. I have made the change and run it again, it's on the queue. Will update .

— Reply to this email directly or view it on GitHub.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ebreitmo commented 9 years ago

Vivek, I emailed you extasy.log. Since it hangs after it fails and doesn't return to the command line, I have to use Ctrl-X-S to produce the extasy.log file. That's what the last few lines in the file are about.

vivek-bala commented 9 years ago

I didn't get any attachments in the email. But I did get this same error again.

1) There were some changes in the default modules loaded (numpy, scipy ), I believe. I have added the specific module versions to load the required modules and the CoCo phase seems to be running fine.

2) The Amber phase error seems to be irregular. For the same workload and same ExTASY version (after the above change), the amber stage fails with the error you got about 50% of the time. I have tried it 4 times, with 2 failures and 2 successes.

Not sure how to proceed on this.

ibethune commented 9 years ago

I tried running coco/amber on ARCHER this afternoon and hit an error with the same signature as Elena. I have put the entire pilot-* directory and the logs from the extasy script in /work/e290/e290/shared/iain

From what I have seen, CU 5512bb34d7bf75b9ae146101 fails. This is a simulation CU from Cycle 1. All the CUs in cycle 0 completed successfully, and several other simulation CUs in cycle 1 also completed ok.

The immediate cause of the failure is that the AMBER executable failed to open md1.crd (see the md1.out and STDERR file for details).

This file does appear to exist, and is a symlink to the staging_area:

lrwxrwxrwx 1 e290ib e290 102 Mar 25 13:42 md1.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/iter1/md11.crd

This file is a symlink to one of the previous CUs (the analysis step in Cycle 1):

lrwxrwxrwx 1 e290ib e290 114 Mar 25 13:41 min11.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/unit-5512ba65d7bf75b9ae1460f7/min11.crd

And that file appears to contain valid data:

e290ib@eslogin006:/work/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/unit-5512ba65d7bf75b9ae1460f7> head min11.crd default_name 59 15.4120000 8.3360000 3.0810000 14.8320000 8.1970000 2.8460000 15.1830000 8.5420000 3.5390000 15.4040000 8.2920000 3.4110000 12.7410000 7.3830000 0.8000000 13.3660000 7.4450000 -1.3240000 10.2760000 6.6220000 1.4260000 9.9820000 6.5560000 3.1950000 7.9580000 5.7680000 -0.2820000 8.7040000 4.8660000 -1.7090000 5.9510000 4.5700000 0.6520000 5.7780000 5.3060000 1.7430000 4.5340000 3.5640000 0.2380000 6.1730000 3.9610000 0.7160000 6.6120000 7.1400000 -1.1190000 6.2220000 7.4560000 -1.8240000

There is nothing in the output from coco or the postexec.py to indicate that anything went wrong here…

Any ideas?

I am trying to run again to see if this is repeatable.

ibethune commented 9 years ago

The failure is indeed repeatable. I ran again this morning and it failed in exactly the same way, with pmemd failing to open a file md1.crd, which appears to be valid...

md1.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5513ccf1d7bf7531524aee42/staging_area/iter1/md11.crd

ashkurti commented 9 years ago

I have moved everything locally unzipped and explored files locally.

I have a first impression that the crd file required from the last-failing unit exists but is empty:

ardita@poirot 120% pwd
/users/ardita/extasy_archer_investig/pilot-5512b723d7bf75b9ae1460e5/unit-5512bb34d7bf75b9ae146101
ardita@poirot 121% ls -l
total 9912
-rw------- 1 ardita pa 10121216 Mar 25 13:42 core
lrwxrwxrwx 1 ardita pa      102 Mar 26 10:08 md1.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/iter1/md11.crd
-rw------- 1 ardita pa     2325 Mar 25 13:42 md1.out
lrwxrwxrwx 1 ardita pa       98 Mar 26 10:08 mdshort.in -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/mdshort.in
lrwxrwxrwx 1 ardita pa       97 Mar 26 10:08 penta.top -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/penta.top
-rwx------ 1 ardita pa      356 Mar 25 13:42 radical_pilot_cu_launch_script-_VvLL0.sh
-rw------- 1 ardita pa     1016 Mar 25 13:42 STDERR
-rw------- 1 ardita pa      137 Mar 25 13:42 STDOUT
ardita@poirot 122% ls -l ../staging_area/iter1/md11.crd
lrwxrwxrwx 1 ardita pa 112 Mar 26 10:08 ../staging_area/iter1/md11.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/unit-5512bb28d7bf75b9ae1460f9/md1.crd
ardita@poirot 123% ls -l ../unit-5512bb28d7bf75b9ae1460f9/md1.crd
-rw------- 1 ardita pa 0 Mar 25 13:53 ../unit-5512bb28d7bf75b9ae1460f9/md1.crd
ashkurti commented 9 years ago

Just to add that I have investigated on the 5512bb34d7bf75b9ae146101 CU unit folders since it was the first to fail as I detected from the log Iain provided:

2015:03:25 13:42:26 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '5512bb34d7bf75b9ae146101' state changed from 'Executing' to 'Failed'.
ashkurti commented 9 years ago

In the CU-unit that should generate the .crd file (that currently is empty !??!) we did not notice any clue of what might have been wrong. The STDERR file is empty too and the STDOUT file does not help:

more STDOUT
Application 13349432 resources: utime ~0s, stime ~0s, Rss ~4104, inblocks ~11452, outblocks ~30287

Does anyone know whether AMBER has been recompiled on ARCHER?

Recently, in local machines here, Charlie has noticed that a recompiled version of AMBER would give problems (the produced restart files would have a differing name with respect to what the user required) while dealing with restart files such as the .crd one that we needed here.

ashkurti commented 9 years ago

In the directory that should have produced the correct (and not the empty) crd file (CU: unit-5512bb28d7bf75b9ae1460f9) I just noticed that the referenced penta.top file, is an empty file ...

/users/ardita/extasy_archer_investig/pilot-5512b723d7bf75b9ae1460e5/unit-5512bb28d7bf75b9ae1460f9
ardita@poirot 179% ls
logfile  min1.crd  min1.out  penta.crd  radical_pilot_cu_launch_script-onhz2h.sh  STDOUT
md1.crd  min1.inf  min.in    penta.top  STDERR
ardita@poirot 180% ls -l
total 48
-rw------- 1 ardita pa  2347 Mar 25 13:42 logfile
-rw------- 1 ardita pa     0 Mar 25 13:53 md1.crd
lrwxrwxrwx 1 ardita pa   103 Mar 26 10:08 min1.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/iter1/min11.crd
-rw------- 1 ardita pa   409 Mar 25 13:42 min1.inf
-rw------- 1 ardita pa 12839 Mar 25 13:42 min1.out
lrwxrwxrwx 1 ardita pa    94 Mar 26 10:08 min.in -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/min.in
lrwxrwxrwx 1 ardita pa    97 Mar 26 10:08 penta.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/penta.crd
lrwxrwxrwx 1 ardita pa    97 Mar 26 10:08 penta.top -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/penta.top
-rwx------ 1 ardita pa   358 Mar 25 13:42 radical_pilot_cu_launch_script-onhz2h.sh
-rw------- 1 ardita pa     0 Mar 25 13:42 STDERR
-rw------- 1 ardita pa    99 Mar 25 13:42 STDOUT
ardita@poirot 181% ls -l ../staging_area/pen
penta.crd  penta.top
ardita@poirot 181% ls -l ../staging_area/penta.top
-rw-r--r-- 1 ardita pa 0 Mar 25 13:53 ../staging_area/penta.top
ibethune commented 9 years ago

OK I see the problem now!

CU unit-_101 (the failing CU) starts before the preceding CU unit-_0f9 has completed.

In the extasy.callbacks file, there is never a callback to say CU 0f9 is Done, before CU 101 stages input files and starts executing.

In this case, the overlap must be quite close, since the first CU does indeed start before the second.

However, I ran another example (files in /work/e290/e290/shared/iain/pilot-5513e4a5d7bf759c63c121b8), where the CU which fails (unit-5513e70fd7bf759c63c121d4), has in it's output:

| Run on 03/26/2015 at 11:01:38

This CU requires the file md1.crd

md1.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5513e4a5d7bf759c63c121b8/staging_area/iter1/md11.crd

/fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5513e4a5d7bf759c63c121b8/staging_area/iter1/md11.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5513e4a5d7bf759c63c121b8/unit-5513e6f2d7bf759c63c121cc/md1.crd

In CU *1cc, the output says:

| Run on 03/26/2015 at 11:01:50

i.e. the dependency between these two CUs is not correct. The CU which produces the file must run to completion before the CU which consumes the file starts.

ashkurti commented 9 years ago

I would also see a problem at the empty penta.top file in the staging area ... maybe there was a transport failure for the penta.top file ...

ibethune commented 9 years ago

Yes, you're right. Also the mdshort.in is empty in the staging area. However those files have been used in previous successful CUs, so I don't know if this is really a problem or not. However, the incorrect ordering of CUs I reported above certainly is a bug ( I hope)

vivek-bala commented 9 years ago

The ordering of CU has been fixed in devel.