nrc-cnrc / EGSnrc

Toolkit for Monte Carlo simulation of ionizing radiation — Trousse d'outils logiciels pour la simulation Monte Carlo du rayonnement ionisant
http://nrc-cnrc.github.io/EGSnrc
GNU Affero General Public License v3.0
216 stars 146 forks source link

Parallel jobs fail to lock or rewind the control file at end of run #438

Closed mchamberland closed 3 years ago

mchamberland commented 6 years ago

I'm seeing the same behaviour that @ojalaj reported in the comments of PR #368: parallel jobs end with an error ("failed to lock or rewind the control file")

For what it's worth, I see it when running a BEAM accelerator with an IAEA phase space source and also when running DOSXYZnrc with source 20 and a shared library (either BEAM or external).

Not sure if this is general lock file troubles or if there's anything else going on.

ojalaj commented 5 years ago

Great - thank you @rtownson !

edit: And yes, it is well-documented (https://nrc-cnrc.github.io/EGSnrc/doc/pirs898/common.html), so I should have looked there first!

Abdullah-Abuhaimed commented 4 years ago

Dear all,

I use EGSnrc 2020, so far everything works properly with an exception to running parallel jobs at the beginning, similar to the error reported here. I have tried with different user codes (beamnrc, dosxyznrc, and cavity) and had the same error. Our system is slurm, and when I submit parallel jobs, they are not working at all and get the error below:

***** Error:

Failed to create a lock file named /home/aabuhaimed/EGSnrc/......lock

***** Quiting now.

and I get the lines below in the each error file:

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file after 1 minute wait!

The lock file is created by the first job in the same directory, but it is empty and the jobs do not run. I tried different ways, but still not working. Any idea how to fix this issue?

rtownson commented 4 years ago

Hi @Abdullah-Abuhaimed, I see that @blakewalters is addressing your question on reddit and via email, so we will not follow up here.

TTianCui commented 4 years ago

Dear all, I run a BEAM accelerator with an IAEA phase space source and DOSXYZnrc with source 20 and a shared library.When the DOSXYZnrc Photpn splitting number set ≤1,the parallel jobs make sucessfully.However,when the Photpn splitting number set >1,the single thread make it,parallel jobs cannot run sucessfully.When I submit parallel jobs,all of them are working at a short time and failed. The egslogfile is end at 'will perform charged-particle range rejection against voxel bounddaries'. I tried different way,increase or decrease the number of parallel and history,but still no working.Any adea about this issue?

jedarko commented 4 years ago

Dear all,

I use EGSnrc 2020, so far everything works properly with an exception to running parallel jobs at the beginning, similar to the error reported here. I have tried with different user codes (beamnrc, dosxyznrc, and cavity) and had the same error. Our system is slurm, and when I submit parallel jobs, they are not working at all and get the error below:

***** Error:

Failed to create a lock file named /home/aabuhaimed/EGSnrc/......lock

***** Quiting now.

and I get the lines below in the each error file:

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file after 1 minute wait!

The lock file is created by the first job in the same directory, but it is empty and the jobs do not run. I tried different ways, but still not working. Any idea how to fix this issue?

Was there a resolution to this issue. i seems to have the exact same issue.

crcrewso commented 4 years ago

There are two solutions, in batch_options there should be a wait time between jobs control, for slurm try setting this to something large like 5 seconds (make sure it's not a multiple of 12, primes probably are better here)

Or you could try https://github.com/crcrewso/EGSnrc/commit/247567404d0e5a7aeddc69b0418bbfda359e475a

Edit, forgot I submitted this as PR #499

jedarko commented 4 years ago

I tried crcrewso@2475674 and it worked like a charm. thanks

ftessier commented 3 years ago

I see that #499 was merged and there were additional improvements regarding lock file issues in Release 2021, notably the uniform run control object (#588) and the new egs-parallel scripts (#628). Hence I will close this Issue for now. Don't hesitate to reopen it if the infamous lock file rears its ugly head again 😄 .