Closed ebreitmo closed 9 years ago
Seems like the required files aren't being staged to the CU. Could you post the entire log please? Also, could you check if the input files are in pilot-*/staging_area ?
I'm not able to login to Archer ssh_exchange_identification: Connection closed by remote host
. Is this a cause of the issues you're referring to ? On the archer page, I see that there are issues on the login node.
Please find the log-file attached.
Yes, I encounter the same issue for ARCHER today, and that’s why I can’t do any further tests until this will be fixed.
Cheers, Elena
Dr Elena Breitmoser
EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD
Tel: +44 131 650 6494
On 2 Mar 2015, at 14:38, Vivekanandan (Vivek) Balasubramanian notifications@github.com wrote:
Seems like the required files aren't being staged to the CU. Could you post the entire log please? Also, could you check if the input files are in pilot-*/staging_area ?
I'm not able to login to Archer ssh_exchange_identification: Connection closed by remote host. Is this a cause of the issues you're referring to ? On the archer page, I see that there are issues on the login node.
— Reply to this email directly or view it on GitHub.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hi Elena,
I didn't receive any attached files in your email. Could you send in a separate email or as a gist pls.
Vivek
On Mon, Mar 2, 2015 at 10:46 AM, ebreitmo notifications@github.com wrote:
Please find the log-file attached.
Yes, I encounter the same issue for ARCHER today, and that’s why I can’t do any further tests until this will be fixed.
Cheers, Elena
Dr Elena Breitmoser
EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD
Tel: +44 131 650 6494
On 2 Mar 2015, at 14:38, Vivekanandan (Vivek) Balasubramanian < notifications@github.com> wrote:
Seems like the required files aren't being staged to the CU. Could you post the entire log please? Also, could you check if the input files are in pilot-*/staging_area ?
I'm not able to login to Archer ssh_exchange_identification: Connection closed by remote host. Is this a cause of the issues you're referring to ? On the archer page, I see that there are issues on the login node.
— Reply to this email directly or view it on GitHub.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
— Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/ExTASY/issues/148#issuecomment-76735394 .
I think this can be something momentary. One of the simulations succeeds while the other from the same stage fails. Will test again when access is back.
2015:02:27 11:27:41 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'StagingOutput' to 'Done'.
2015:02:27 11:27:42 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'Executing' to 'Failed'.
Hi Vivek, it would seem that this could be caused by the login failures on ARCHER. Do you think it would be possible when a file transfer fails that the workflow fails at that point, rather than executing subsequent CUs which fail because a file was not available?
The CU should infact fail at input staging. (Log : https://gist.githubusercontent.com/vivek-bala/a2de0d0442091f538348/raw/issue_148).
015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingInputStaging' to 'StagingInput'.
2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'StagingInput' to 'PendingExecution'.
2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'StagingInput' to 'PendingExecution'.
2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'PendingExecution' to 'Scheduling'.
2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingExecution' to 'Scheduling'.
2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'Scheduling' to 'Executing'.
2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'Scheduling' to 'Executing'
But seems like that the CU '54f0548e4c917a060183503d' is being registered as successful transfer, it goes to StagingInput before Executing.
If the file was not available in the original directory this would have failed. I believe the file should be present though, since the other CUs execute successfully. I am sure the staging will fail if the file at the source was not present, I am not sure if there is a check on the target/destination or if it is required at all.
But I think if the nodes are stable again, we should retry and see if we encounter this problem again.
It’s queuing…
Cheers, Elena
Dr Elena Breitmoser
EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD
Tel: +44 131 650 6494
On 3 Mar 2015, at 15:14, Vivekanandan (Vivek) Balasubramanian notifications@github.com wrote:
The CU should infact fail at input staging. (Log : https://gist.githubusercontent.com/vivek-bala/a2de0d0442091f538348/raw/issue_148).
015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingInputStaging' to 'StagingInput'. 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'StagingInput' to 'PendingExecution'. 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'StagingInput' to 'PendingExecution'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'PendingExecution' to 'Scheduling'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingExecution' to 'Scheduling'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'Scheduling' to 'Executing'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'Scheduling' to 'Executing' But seems like that the CU '54f0548e4c917a060183503d' is being registered as successful transfer, it goes to StagingInput before Executing.
If the file was not available in the original directory this would have failed. I believe the file should be present though, since the other CUs execute successfully. I am sure the staging will fail if the file at the source was not present, I am not sure if there is a check on the target/destination or if it is required at all.
But I think if the nodes are stable again, we should retry and see if we encounter this problem again.
— Reply to this email directly or view it on GitHub.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hi Vivek,
I am also trying to run the same stuff from a Linux machine.
/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-54f6da7723d96e0ee51f7b3c/unit-54f6daf423d96e0ee51f7b49> more STDERR
Unit 9 Error on OPEN: md0.crd
Rank 0 [Wed Mar 4 10:14:44 2015] [c2-0c0s4n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - proc ess 0
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
_pmiu_daemon(SIGCHLD): [NID 00402] [c2-0c0s4n2] [Wed Mar 4 10:14:44 2015] PE RANK 0 exit signal Abor ted [NID 00402] 2015-03-04 10:14:44 Apid 13089484: initiated application termination
I attach the extasy.log-file again.
Cheers, Elena
Dr Elena Breitmoser
EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD
Tel: +44 131 650 6494
On 3 Mar 2015, at 15:16, Elena Breitmoser e.breitmoser@epcc.ed.ac.uk wrote:
It’s queuing…
Cheers, Elena
Dr Elena Breitmoser
EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD
Tel: +44 131 650 6494
On 3 Mar 2015, at 15:14, Vivekanandan (Vivek) Balasubramanian notifications@github.com wrote:
The CU should infact fail at input staging. (Log : https://gist.githubusercontent.com/vivek-bala/a2de0d0442091f538348/raw/issue_148).
015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingInputStaging' to 'StagingInput'. 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'StagingInput' to 'PendingExecution'. 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'StagingInput' to 'PendingExecution'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'PendingExecution' to 'Scheduling'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingExecution' to 'Scheduling'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'Scheduling' to 'Executing'. 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'Scheduling' to 'Executing' But seems like that the CU '54f0548e4c917a060183503d' is being registered as successful transfer, it goes to StagingInput before Executing.
If the file was not available in the original directory this would have failed. I believe the file should be present though, since the other CUs execute successfully. I am sure the staging will fail if the file at the source was not present, I am not sure if there is a check on the target/destination or if it is required at all.
But I think if the nodes are stable again, we should retry and see if we encounter this problem again.
— Reply to this email directly or view it on GitHub.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hi Vivek,
I installed extasy again from scratch.
ebreitmo@eslogin002:/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-54feecb74c917a9428e6fa08/unit-54ff4b964c917a9428e6fa13> more STDERR
Unit 9 Error on OPEN: md0.crd
Rank 0 [Tue Mar 10 19:52:58 2015] [c1-3c0s8n3] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
_pmiu_daemon(SIGCHLD): [NID 04835] [c1-3c0s8n3] [Tue Mar 10 19:52:58 2015] PE RANK 0 exit signal Aborted [NID 04835] 2015-03-10 19:52:58 Apid 13176795: initiated application termination ebreitmo@eslogin002:/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-54feecb74c917a9428e6fa08/unit-54ff4b964c917a9428e6fa13> more STDOUT Application 13176795 exit codes: 134
Cheers, Elena
Dr Elena Breitmoser
EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD
Tel: +44 131 650 6494
On 4 Mar 2015, at 10:21, Elena Breitmoser e.breitmoser@epcc.ed.ac.uk wrote:
Hi Vivek,
I am also trying to run the same stuff from a Linux machine.
/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-54f6da7723d96e0ee51f7b3c/unit-54f6daf423d96e0ee51f7b49> more STDERR
Unit 9 Error on OPEN: md0.crd
Rank 0 [Wed Mar 4 10:14:44 2015] [c2-0c0s4n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - proc ess 0
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
0 0x8ABA6D in _gfortran_backtrace at backtrace.c:258
1 0x8947E0 in _gfortrani_backtrace_handler at compile_options.c:129
2 0x90C01F in raise
3 0x90BFDB in raise at pt-raise.c:41
4 0x91C5D0 in abort at abort.c:92
5 0x7DED71 in MPID_Abort
6 0x7BFBD2 in MPI_Abort
7 0x797BB4 in pmpi_abort
8 0x4A2156 in __pmemd_lib_mod_MOD_mexit
9 0x4ADE49 in __file_io_mod_MOD_amopen
10 0x41F363 in __inpcrd_dat_mod_MOD_init_inpcrd_dat
11 0x4CF27E in __master_setup_mod_MOD_master_setup
12 0x4B65AC in MAIN__ at pmemd.F90:0
_pmiu_daemon(SIGCHLD): [NID 00402] [c2-0c0s4n2] [Wed Mar 4 10:14:44 2015] PE RANK 0 exit signal Abor ted [NID 00402] 2015-03-04 10:14:44 Apid 13089484: initiated application termination
I attach the extasy.log-file again.
Cheers, Elena --- Dr Elena Breitmoser EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD Tel: +44 131 650 6494 On 3 Mar 2015, at 15:16, Elena Breitmoser e.breitmoser@epcc.ed.ac.uk wrote: > It’s queuing… > > Cheers, > Elena > > --- > > Dr Elena Breitmoser > > EPCC, University of Edinburgh > JCMB, Room 3401 > Peter Guthrie Tait Road > UK-Edinburgh EH9 3FD > > Tel: +44 131 650 6494 > > On 3 Mar 2015, at 15:14, Vivekanandan (Vivek) Balasubramanian notifications@github.com wrote: > > > The CU should infact fail at input staging. (Log : https://gist.githubusercontent.com/vivek-bala/a2de0d0442091f538348/raw/issue_148). > > > > 015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingInputStaging' to 'StagingInput'. > > 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'StagingInput' to 'PendingExecution'. > > 2015:02:27 11:27:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'StagingInput' to 'PendingExecution'. > > 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'PendingExecution' to 'Scheduling'. > > 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'PendingExecution' to 'Scheduling'. > > 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503e' state changed from 'Scheduling' to 'Executing'. > > 2015:02:27 11:27:12 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54f0548e4c917a060183503d' state changed from 'Scheduling' to 'Executing' > > But seems like that the CU '54f0548e4c917a060183503d' is being registered as successful transfer, it goes to StagingInput before Executing. > > > > If the file was not available in the original directory this would have failed. I believe the file should be present though, since the other CUs execute successfully. I am sure the staging will fail if the file at the source was not present, I am not sure if there is a check on the target/destination or if it is required at all. > > > > But I think if the nodes are stable again, we should retry and see if we encounter this problem again. > > > > — > > Reply to this email directly or view it on GitHub.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Tried again, still a problem.
... [Callback]: ComputeUnit '550b78ad4c917ad482255161' state changed to Failed. #######################
#######################
Then checking on ARCHER
ebreitmo@eslogin004:/work/e290/e290/ebreitmo/radical.pilot.sandbox/pilot-550ae74f4c917ad482255156/unit-550b78ad4c917ad482255161> more STDERR
Unit 9 Error on OPEN: md0.crd
Rank 0 [Fri Mar 20 01:32:53 2015] [c4-1c2s7n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
_pmiu_daemon(SIGCHLD): [NID 02461] [c4-1c2s7n1] [Fri Mar 20 01:32:53 2015] PE RANK 0 exit signal Aborted
I copied the directory to /work/e290/e290shared/elena, there is also a core file.
I tried it too. I didn't get this error in particular. Could you post the entire log please ? Is this the first or the second iteration ? I got an error in the coco stage, seems like the module pre reqs for coco, scipy have changed. I have made the change and run it again, it's on the queue. Will update .
Cheers, Elena
Dr Elena Breitmoser
EPCC, University of Edinburgh JCMB, Room 3401 Peter Guthrie Tait Road UK-Edinburgh EH9 3FD
Tel: +44 131 650 6494
On 22 Mar 2015, at 03:30, Vivekanandan (Vivek) Balasubramanian notifications@github.com wrote:
I tried it too. I didn't get this error in particular. Could you post the entire log please ? Is this the first or the second iteration ? I got an error in the coco stage, seems like the module pre reqs for coco, scipy have changed. I have made the change and run it again, it's on the queue. Will update .
— Reply to this email directly or view it on GitHub.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Vivek, I emailed you extasy.log. Since it hangs after it fails and doesn't return to the command line, I have to use Ctrl-X-S to produce the extasy.log file. That's what the last few lines in the file are about.
I didn't get any attachments in the email. But I did get this same error again.
1) There were some changes in the default modules loaded (numpy, scipy ), I believe. I have added the specific module versions to load the required modules and the CoCo phase seems to be running fine.
2) The Amber phase error seems to be irregular. For the same workload and same ExTASY version (after the above change), the amber stage fails with the error you got about 50% of the time. I have tried it 4 times, with 2 failures and 2 successes.
Not sure how to proceed on this.
I tried running coco/amber on ARCHER this afternoon and hit an error with the same signature as Elena. I have put the entire pilot-* directory and the logs from the extasy script in /work/e290/e290/shared/iain
From what I have seen, CU 5512bb34d7bf75b9ae146101 fails. This is a simulation CU from Cycle 1. All the CUs in cycle 0 completed successfully, and several other simulation CUs in cycle 1 also completed ok.
The immediate cause of the failure is that the AMBER executable failed to open md1.crd (see the md1.out and STDERR file for details).
This file does appear to exist, and is a symlink to the staging_area:
lrwxrwxrwx 1 e290ib e290 102 Mar 25 13:42 md1.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/iter1/md11.crd
This file is a symlink to one of the previous CUs (the analysis step in Cycle 1):
lrwxrwxrwx 1 e290ib e290 114 Mar 25 13:41 min11.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/unit-5512ba65d7bf75b9ae1460f7/min11.crd
And that file appears to contain valid data:
e290ib@eslogin006:/work/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/unit-5512ba65d7bf75b9ae1460f7> head min11.crd default_name 59 15.4120000 8.3360000 3.0810000 14.8320000 8.1970000 2.8460000 15.1830000 8.5420000 3.5390000 15.4040000 8.2920000 3.4110000 12.7410000 7.3830000 0.8000000 13.3660000 7.4450000 -1.3240000 10.2760000 6.6220000 1.4260000 9.9820000 6.5560000 3.1950000 7.9580000 5.7680000 -0.2820000 8.7040000 4.8660000 -1.7090000 5.9510000 4.5700000 0.6520000 5.7780000 5.3060000 1.7430000 4.5340000 3.5640000 0.2380000 6.1730000 3.9610000 0.7160000 6.6120000 7.1400000 -1.1190000 6.2220000 7.4560000 -1.8240000
There is nothing in the output from coco or the postexec.py to indicate that anything went wrong here…
Any ideas?
I am trying to run again to see if this is repeatable.
The failure is indeed repeatable. I ran again this morning and it failed in exactly the same way, with pmemd failing to open a file md1.crd, which appears to be valid...
md1.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5513ccf1d7bf7531524aee42/staging_area/iter1/md11.crd
I have moved everything locally unzipped and explored files locally.
I have a first impression that the crd file required from the last-failing unit exists but is empty:
ardita@poirot 120% pwd
/users/ardita/extasy_archer_investig/pilot-5512b723d7bf75b9ae1460e5/unit-5512bb34d7bf75b9ae146101
ardita@poirot 121% ls -l
total 9912
-rw------- 1 ardita pa 10121216 Mar 25 13:42 core
lrwxrwxrwx 1 ardita pa 102 Mar 26 10:08 md1.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/iter1/md11.crd
-rw------- 1 ardita pa 2325 Mar 25 13:42 md1.out
lrwxrwxrwx 1 ardita pa 98 Mar 26 10:08 mdshort.in -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/mdshort.in
lrwxrwxrwx 1 ardita pa 97 Mar 26 10:08 penta.top -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/penta.top
-rwx------ 1 ardita pa 356 Mar 25 13:42 radical_pilot_cu_launch_script-_VvLL0.sh
-rw------- 1 ardita pa 1016 Mar 25 13:42 STDERR
-rw------- 1 ardita pa 137 Mar 25 13:42 STDOUT
ardita@poirot 122% ls -l ../staging_area/iter1/md11.crd
lrwxrwxrwx 1 ardita pa 112 Mar 26 10:08 ../staging_area/iter1/md11.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/unit-5512bb28d7bf75b9ae1460f9/md1.crd
ardita@poirot 123% ls -l ../unit-5512bb28d7bf75b9ae1460f9/md1.crd
-rw------- 1 ardita pa 0 Mar 25 13:53 ../unit-5512bb28d7bf75b9ae1460f9/md1.crd
Just to add that I have investigated on the 5512bb34d7bf75b9ae146101 CU unit folders since it was the first to fail as I detected from the log Iain provided:
2015:03:25 13:42:26 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '5512bb34d7bf75b9ae146101' state changed from 'Executing' to 'Failed'.
In the CU-unit that should generate the .crd file (that currently is empty !??!) we did not notice any clue of what might have been wrong. The STDERR file is empty too and the STDOUT file does not help:
more STDOUT
Application 13349432 resources: utime ~0s, stime ~0s, Rss ~4104, inblocks ~11452, outblocks ~30287
Does anyone know whether AMBER has been recompiled on ARCHER?
Recently, in local machines here, Charlie has noticed that a recompiled version of AMBER would give problems (the produced restart files would have a differing name with respect to what the user required) while dealing with restart files such as the .crd one that we needed here.
In the directory that should have produced the correct (and not the empty) crd file (CU: unit-5512bb28d7bf75b9ae1460f9) I just noticed that the referenced penta.top file, is an empty file ...
/users/ardita/extasy_archer_investig/pilot-5512b723d7bf75b9ae1460e5/unit-5512bb28d7bf75b9ae1460f9
ardita@poirot 179% ls
logfile min1.crd min1.out penta.crd radical_pilot_cu_launch_script-onhz2h.sh STDOUT
md1.crd min1.inf min.in penta.top STDERR
ardita@poirot 180% ls -l
total 48
-rw------- 1 ardita pa 2347 Mar 25 13:42 logfile
-rw------- 1 ardita pa 0 Mar 25 13:53 md1.crd
lrwxrwxrwx 1 ardita pa 103 Mar 26 10:08 min1.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/iter1/min11.crd
-rw------- 1 ardita pa 409 Mar 25 13:42 min1.inf
-rw------- 1 ardita pa 12839 Mar 25 13:42 min1.out
lrwxrwxrwx 1 ardita pa 94 Mar 26 10:08 min.in -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/min.in
lrwxrwxrwx 1 ardita pa 97 Mar 26 10:08 penta.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/penta.crd
lrwxrwxrwx 1 ardita pa 97 Mar 26 10:08 penta.top -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5512b723d7bf75b9ae1460e5/staging_area/penta.top
-rwx------ 1 ardita pa 358 Mar 25 13:42 radical_pilot_cu_launch_script-onhz2h.sh
-rw------- 1 ardita pa 0 Mar 25 13:42 STDERR
-rw------- 1 ardita pa 99 Mar 25 13:42 STDOUT
ardita@poirot 181% ls -l ../staging_area/pen
penta.crd penta.top
ardita@poirot 181% ls -l ../staging_area/penta.top
-rw-r--r-- 1 ardita pa 0 Mar 25 13:53 ../staging_area/penta.top
OK I see the problem now!
CU unit-_101 (the failing CU) starts before the preceding CU unit-_0f9 has completed.
In the extasy.callbacks file, there is never a callback to say CU 0f9 is Done, before CU 101 stages input files and starts executing.
In this case, the overlap must be quite close, since the first CU does indeed start before the second.
However, I ran another example (files in /work/e290/e290/shared/iain/pilot-5513e4a5d7bf759c63c121b8), where the CU which fails (unit-5513e70fd7bf759c63c121d4), has in it's output:
| Run on 03/26/2015 at 11:01:38
This CU requires the file md1.crd
md1.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5513e4a5d7bf759c63c121b8/staging_area/iter1/md11.crd
/fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5513e4a5d7bf759c63c121b8/staging_area/iter1/md11.crd -> /fs4/e290/e290/e290ib/radical.pilot.sandbox/pilot-5513e4a5d7bf759c63c121b8/unit-5513e6f2d7bf759c63c121cc/md1.crd
In CU *1cc, the output says:
| Run on 03/26/2015 at 11:01:50
i.e. the dependency between these two CUs is not correct. The CU which produces the file must run to completion before the CU which consumes the file starts.
I would also see a problem at the empty penta.top file in the staging area ... maybe there was a transport failure for the penta.top file ...
Yes, you're right. Also the mdshort.in is empty in the staging area. However those files have been used in previous successful CUs, so I don't know if this is really a problem or not. However, the incorrect ordering of CUs I reported above certainly is a bug ( I hope)
The ordering of CU has been fixed in devel.
On Friday my Coco/Amber job running from my Mac on ARCHER encountered problems:
ls -lrt -rwx------ 1 ebreitmo e290 358 Feb 27 11:27 radical_pilot_cu_launch_script-uyPGKU.sh -rw------- 1 ebreitmo e290 137 Feb 27 11:27 STDOUT -rw------- 1 ebreitmo e290 1016 Feb 27 11:27 STDERR -rw------- 1 ebreitmo e290 2325 Feb 27 11:27 md0.out -rw------- 1 ebreitmo e290 10121216 Feb 27 11:27 core
Unit 9 Error on OPEN: md0.crd
Rank 0 [Fri Feb 27 11:27:42 2015] [c1-2c2s9n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
0 0x8ABA6D in _gfortran_backtrace at backtrace.c:258
1 0x8947E0 in _gfortrani_backtrace_handler at compile_options.c:129
2 0x90C01F in raise
3 0x90BFDB in raise at pt-raise.c:41
4 0x91C5D0 in abort at abort.c:92
5 0x7DED71 in MPID_Abort
6 0x7BFBD2 in MPI_Abort
7 0x797BB4 in pmpi_abort
8 0x4A2156 in __pmemd_lib_mod_MOD_mexit
9 0x4ADE49 in __file_io_mod_MOD_amopen
10 0x41F363 in __inpcrd_dat_mod_MOD_init_inpcrd_dat
11 0x4CF27E in __master_setup_mod_MOD_master_setup
12 0x4B65AC in MAIN__ at pmemd.F90:0
_pmiu_daemon(SIGCHLD): [NID 03430] [c1-2c2s9n2] [Fri Feb 27 11:27:42 2015] PE RANK 0 exit signal Aborted [NID 03430] 2015-02-27 11:27:42 Apid 13064419: initiated application termination
I can’t re-run it right now, as there are some issues on ARCHER.
Elena