Open walleludvig opened 2 years ago
Hi @mchamberland,
If I am correct I remember you recently purchased an M1, any chance you've run any parallel processing jobs yet?
Cheers, LW
@walleludvig I have not yet tried it on my new Mac, but in the past, I’ve had to modify the script to run on macOS. Let me see if I’ve kept a copy of my modifications…
@walleludvig My version works.
I didn’t bother trying the official version distributed with EGSnrc. I’ll try that one tonight and see if I also get an error.
The official version does not seem to work out of the box. Only one job runs. Not sure if an error gets output somewhere.
Thanks for your response @mchamberland,
all jobs launch successfully for me as well, but immediately returns the abort trap 6.
Any ideas on what I might attempt to resolve this?
@walleludvig Sadly, no. I know next to nothing about bash scripting. Off the top of my head, I would try submitting just one job with the egs-parallel script. Does that fail? If so, then I’d remove the > /dev/null 2>&1
part of the line in the script and see if it works.
@ftessier any ideas?
Thanks for suggestions anyway @mchamberland. However, still no luck with any amount of jobs submitted using egs-parallel.sh script.
For reference I do know that the job I try to submit can run (i.e., I have ran it the normal 1 thread way: <accelerator> -i <inputfile> -p <pegs4data>
)
@walleludvig How about you try the following:
egs-parallel -n 4 -d 2 -f -v -c 'your command that you're trying to run'
Use the egs-parallel that's distributed with EGSnrc. The verbose option might tell us something.
The official version does not seem to work out of the box. Only one job runs. Not sure if an error gets output somewhere.
For the record, this was caused by using a delay of 0 for job submissions, which meant the lock file did not have time to be created before the other jobs tried to access it. Setting the delay to a non-zero value resolved this issue.
@walleludvig it seems you are using an old script. Use instead the egs-parallel script that is included with EGSnrc; this is the one @mchamberland is talking about. I will comment further when I am back at the office.
Thanks @mchamberland & @ftessier,
Using the included EGSnrc egs-parallel (at HEN_HOUSE/scripts/bin) I get the same error. Here's the verbose output:
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: BEGIN egs-parallel
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: EGSnrc environment:
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: HEN_HOUSE = /Users/ludvigwalle/EGSnrc/HEN_HOUSE/
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: EGS_HOME = /Users/ludvigwalle/EGSnrc/egs_home/
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: EGS_CONFIG = /Users/ludvigwalle/EGSnrc/HEN_HOUSE/specs/osx.conf
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: parallel options:
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: batch = cpu
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: queue = long
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: nthread = 4
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: delay = 2
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: command = BEAM_XSample_mod -i XSample_mod -p XSample
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: basename = XSample_mod
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: first job = 1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: options =
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: log file: /Users/ludvigwalle/EGSnrc/egs_home/BEAM_XSample_mod/XSample_mod.egsparallel
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: cd /Users/ludvigwalle/EGSnrc/egs_home/BEAM_XSample_mod
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: EXEC egs-parallel-cpu long 4 2 1 XSample_mod 'BEAM_XSample_mod -i XSample_mod -p XSample' '' verbose
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: BEGIN /Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: BEGIN host=192-168-1-109.tpgi.com.au
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0001: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 1 -f 1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0001: host=192-168-1-109.tpgi.com.au pid=10285
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0002: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 2 -f 1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0002: host=192-168-1-109.tpgi.com.au pid=10302
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0003: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 3 -f 1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0003: host=192-168-1-109.tpgi.com.au pid=10312
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0004: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 4 -f 1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0004: host=192-168-1-109.tpgi.com.au pid=10322
/Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu: line 154: 10285 Abort trap: 6 $runcommand > /dev/null 2>&1
/Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu: line 154: 10302 Abort trap: 6 $runcommand > /dev/null 2>&1
/Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu: line 154: 10312 Abort trap: 6 $runcommand > /dev/null 2>&1
/Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu: line 154: 10322 Abort trap: 6 $runcommand > /dev/null 2>&1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: DONE.
Thanks for your support.
@walleludvig Hmm... It looks like it's a problem with your simulation running in parallel, in my opinion. You say it runs fine when you launch it interactively?
How about if you just launch it straight from the command line, but with the parallel options, so you can see what error it produces, i.e.:
BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 1 -f 1
Just run that in the terminal.
@mchamberland Yes when I launch it the conventional way BEAM_XSample_mod -i XSample_mod -p XSample
the job runs completely fine.
Also if I launch the job as BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 1 -f 1
it also runs completely fine and as so I can launch each of the parallel jobs 'manually' by:
>> BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 1 -f 1 &
>>```BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 2 -f 1 &
etc..
I therefore tried to remove $runcommand >/dev/null 2>&1 &
from line 142 in egs-parallel-cpu script that egs-parallel calls. Resultantly the abort trap 6
error was evaded but I seem to be missing all the output files from my jobs (e.g., phase space file that I am scoring etc). The jobs only outputs a .egsjob file and a .egsparallel (log) file.
The output to the terminal (and the log file):
(base) ludvigwalle@192-168-1-109 ~ % egs-parallel -n 4 -d 2 -f -v -c 'BEAM_XSample_mod -i XSample_mod -p XSample'
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: BEGIN egs-parallel
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: EGSnrc environment:
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: HEN_HOUSE = /Users/ludvigwalle/EGSnrc/HEN_HOUSE/
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: EGS_HOME = /Users/ludvigwalle/EGSnrc/egs_home/
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: EGS_CONFIG = /Users/ludvigwalle/EGSnrc/HEN_HOUSE/specs/osx.conf
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: parallel options:
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: batch = cpu
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: queue = long
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: nthread = 4
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: delay = 2
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: command = BEAM_XSample_mod -i XSample_mod -p XSample
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: basename = XSample_mod
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: first job = 1
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: options =
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: log file: /Users/ludvigwalle/EGSnrc/egs_home/BEAM_XSample_mod/XSample_mod.egsparallel
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: cd /Users/ludvigwalle/EGSnrc/egs_home/BEAM_XSample_mod
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: EXEC egs-parallel-cpu long 4 2 1 XSample_mod 'BEAM_XSample_mod -i XSample_mod -p XSample' '' verbose
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: BEGIN /Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu
printf: usage: printf [-v var] format [arguments]
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: BEGIN host=192-168-1-109.tpgi.com.au
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0001: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 1 -f 1 &
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0001: host=192-168-1-109.tpgi.com.au pid=
printf: usage: printf [-v var] format [arguments]
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0002: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 2 -f 1 &
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0002: host=192-168-1-109.tpgi.com.au pid=
printf: usage: printf [-v var] format [arguments]
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0003: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 3 -f 1 &
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0003: host=192-168-1-109.tpgi.com.au pid=
printf: usage: printf [-v var] format [arguments]
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0004: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 4 -f 1 &
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0004: host=192-168-1-109.tpgi.com.au pid=
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: DONE.
Hi,
I am writing here to discuss an issue I am having when attempting to run parallel jobs using the script supplied here (egs-parallel.sh): https://egsnrcarchive.home.blog/2014/04/23/running-egsnrc-codes-on-multiple-cores-without-a-queuing-system/comment-page-1/#comment-5246.
I am trying to use it to run 10 parallel jobs but I get an abort trap 6 from each job: ” line 154: <some pid #> Abort trap: 6 $runcommand > /dev/null 2>&1 ”
Any tips on how I might resolve this issue?
(I am running on Mac (M1 chip) (Monterey 12.5))