nrc-cnrc / EGSnrc

Toolkit for Monte Carlo simulation of ionizing radiation — Trousse d'outils logiciels pour la simulation Monte Carlo du rayonnement ionisant
http://nrc-cnrc.github.io/EGSnrc
GNU Affero General Public License v3.0
239 stars 146 forks source link

egs-parallel.sh - Abort trap: 6 #932

Open walleludvig opened 2 years ago

walleludvig commented 2 years ago

Hi,

I am writing here to discuss an issue I am having when attempting to run parallel jobs using the script supplied here (egs-parallel.sh): https://egsnrcarchive.home.blog/2014/04/23/running-egsnrc-codes-on-multiple-cores-without-a-queuing-system/comment-page-1/#comment-5246.

I am trying to use it to run 10 parallel jobs but I get an abort trap 6 from each job: ” line 154: <some pid #> Abort trap: 6 $runcommand > /dev/null 2>&1 ”

Any tips on how I might resolve this issue?

(I am running on Mac (M1 chip) (Monterey 12.5))

walleludvig commented 2 years ago

Hi @mchamberland,

If I am correct I remember you recently purchased an M1, any chance you've run any parallel processing jobs yet?

Cheers, LW

mchamberland commented 2 years ago

@walleludvig I have not yet tried it on my new Mac, but in the past, I’ve had to modify the script to run on macOS. Let me see if I’ve kept a copy of my modifications…

mchamberland commented 2 years ago

@walleludvig My version works.

egs-parallel.sh.txt

Screen Shot 2022-10-24 at 08 24 12
mchamberland commented 2 years ago

I didn’t bother trying the official version distributed with EGSnrc. I’ll try that one tonight and see if I also get an error.

mchamberland commented 2 years ago

The official version does not seem to work out of the box. Only one job runs. Not sure if an error gets output somewhere.

Screen Shot 2022-10-24 at 08 49 24
walleludvig commented 2 years ago

Thanks for your response @mchamberland,

all jobs launch successfully for me as well, but immediately returns the abort trap 6. Screen Shot 2022-10-25 at 8 42 47 am

Any ideas on what I might attempt to resolve this?

mchamberland commented 2 years ago

@walleludvig Sadly, no. I know next to nothing about bash scripting. Off the top of my head, I would try submitting just one job with the egs-parallel script. Does that fail? If so, then I’d remove the > /dev/null 2>&1 part of the line in the script and see if it works.

@ftessier any ideas?

walleludvig commented 2 years ago

Thanks for suggestions anyway @mchamberland. However, still no luck with any amount of jobs submitted using egs-parallel.sh script.

For reference I do know that the job I try to submit can run (i.e., I have ran it the normal 1 thread way: <accelerator> -i <inputfile> -p <pegs4data>)

mchamberland commented 2 years ago

@walleludvig How about you try the following:

egs-parallel -n 4 -d 2 -f -v -c 'your command that you're trying to run'

Use the egs-parallel that's distributed with EGSnrc. The verbose option might tell us something.

mchamberland commented 2 years ago

The official version does not seem to work out of the box. Only one job runs. Not sure if an error gets output somewhere.

Screen Shot 2022-10-24 at 08 49 24

For the record, this was caused by using a delay of 0 for job submissions, which meant the lock file did not have time to be created before the other jobs tried to access it. Setting the delay to a non-zero value resolved this issue.

ftessier commented 2 years ago

@walleludvig it seems you are using an old script. Use instead the egs-parallel script that is included with EGSnrc; this is the one @mchamberland is talking about. I will comment further when I am back at the office.

walleludvig commented 2 years ago

Thanks @mchamberland & @ftessier,

Using the included EGSnrc egs-parallel (at HEN_HOUSE/scripts/bin) I get the same error. Here's the verbose output:

EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: BEGIN egs-parallel
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: EGSnrc environment:
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     HEN_HOUSE  = /Users/ludvigwalle/EGSnrc/HEN_HOUSE/
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     EGS_HOME   = /Users/ludvigwalle/EGSnrc/egs_home/
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     EGS_CONFIG = /Users/ludvigwalle/EGSnrc/HEN_HOUSE/specs/osx.conf
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: parallel options:
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     batch      = cpu
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     queue      = long
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     nthread    = 4
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     delay      = 2
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     command    = BEAM_XSample_mod -i XSample_mod -p XSample
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     basename   = XSample_mod
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     first job  = 1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N:     options    = 
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: log file: /Users/ludvigwalle/EGSnrc/egs_home/BEAM_XSample_mod/XSample_mod.egsparallel
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: cd /Users/ludvigwalle/EGSnrc/egs_home/BEAM_XSample_mod
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: EXEC egs-parallel-cpu long 4 2 1 XSample_mod 'BEAM_XSample_mod -i XSample_mod -p XSample' '' verbose
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: BEGIN /Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: BEGIN host=192-168-1-109.tpgi.com.au
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0001: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 1 -f 1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0001: host=192-168-1-109.tpgi.com.au pid=10285
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0002: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 2 -f 1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0002: host=192-168-1-109.tpgi.com.au pid=10302
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0003: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 3 -f 1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0003: host=192-168-1-109.tpgi.com.au pid=10312
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0004: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 4 -f 1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: job 0004: host=192-168-1-109.tpgi.com.au pid=10322
/Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu: line 154: 10285 Abort trap: 6           $runcommand > /dev/null 2>&1
/Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu: line 154: 10302 Abort trap: 6           $runcommand > /dev/null 2>&1
/Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu: line 154: 10312 Abort trap: 6           $runcommand > /dev/null 2>&1
/Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu: line 154: 10322 Abort trap: 6           $runcommand > /dev/null 2>&1
EGSnrc egs-parallel 2022-10-25 (UTC) 22:45:32.N: DONE.

Thanks for your support.

mchamberland commented 2 years ago

@walleludvig Hmm... It looks like it's a problem with your simulation running in parallel, in my opinion. You say it runs fine when you launch it interactively?

How about if you just launch it straight from the command line, but with the parallel options, so you can see what error it produces, i.e.:

BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 1 -f 1

Just run that in the terminal.

walleludvig commented 2 years ago

@mchamberland Yes when I launch it the conventional way BEAM_XSample_mod -i XSample_mod -p XSample the job runs completely fine.

Also if I launch the job as BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 1 -f 1 it also runs completely fine and as so I can launch each of the parallel jobs 'manually' by:

>> BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 1 -f 1 &
>>```BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 2 -f 1 &
etc..

I therefore tried to remove $runcommand >/dev/null 2>&1 & from line 142 in egs-parallel-cpu script that egs-parallel calls. Resultantly the abort trap 6 error was evaded but I seem to be missing all the output files from my jobs (e.g., phase space file that I am scoring etc). The jobs only outputs a .egsjob file and a .egsparallel (log) file. The output to the terminal (and the log file):

(base) ludvigwalle@192-168-1-109 ~ % egs-parallel -n 4 -d 2 -f -v -c 'BEAM_XSample_mod -i XSample_mod -p XSample'
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: BEGIN egs-parallel
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: EGSnrc environment:
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     HEN_HOUSE  = /Users/ludvigwalle/EGSnrc/HEN_HOUSE/
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     EGS_HOME   = /Users/ludvigwalle/EGSnrc/egs_home/
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     EGS_CONFIG = /Users/ludvigwalle/EGSnrc/HEN_HOUSE/specs/osx.conf
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: parallel options:
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     batch      = cpu
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     queue      = long
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     nthread    = 4
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     delay      = 2
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     command    = BEAM_XSample_mod -i XSample_mod -p XSample
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     basename   = XSample_mod
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     first job  = 1
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N:     options    = 
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: log file: /Users/ludvigwalle/EGSnrc/egs_home/BEAM_XSample_mod/XSample_mod.egsparallel
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: cd /Users/ludvigwalle/EGSnrc/egs_home/BEAM_XSample_mod
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: EXEC egs-parallel-cpu long 4 2 1 XSample_mod 'BEAM_XSample_mod -i XSample_mod -p XSample' '' verbose
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: BEGIN /Users/ludvigwalle/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu
printf: usage: printf [-v var] format [arguments]
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: BEGIN host=192-168-1-109.tpgi.com.au
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0001: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 1 -f 1 &
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0001: host=192-168-1-109.tpgi.com.au pid=
printf: usage: printf [-v var] format [arguments]
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0002: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 2 -f 1 &
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0002: host=192-168-1-109.tpgi.com.au pid=
printf: usage: printf [-v var] format [arguments]
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0003: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 3 -f 1 &
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0003: host=192-168-1-109.tpgi.com.au pid=
printf: usage: printf [-v var] format [arguments]
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0004: RUN BEAM_XSample_mod -i XSample_mod -p XSample -b -P 4 -j 4 -f 1 &
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: job 0004: host=192-168-1-109.tpgi.com.au pid=
EGSnrc egs-parallel 2022-10-27 (UTC) 01:15:02.N: DONE.