Closed mchamberland closed 3 years ago
Great - thank you @rtownson !
edit: And yes, it is well-documented (https://nrc-cnrc.github.io/EGSnrc/doc/pirs898/common.html), so I should have looked there first!
Dear all,
I use EGSnrc 2020, so far everything works properly with an exception to running parallel jobs at the beginning, similar to the error reported here. I have tried with different user codes (beamnrc, dosxyznrc, and cavity) and had the same error. Our system is slurm, and when I submit parallel jobs, they are not working at all and get the error below:
***** Error:
Failed to create a lock file named /home/aabuhaimed/EGSnrc/......lock
***** Quiting now.
and I get the lines below in the each error file:
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file after 1 minute wait!
The lock file is created by the first job in the same directory, but it is empty and the jobs do not run. I tried different ways, but still not working. Any idea how to fix this issue?
Hi @Abdullah-Abuhaimed, I see that @blakewalters is addressing your question on reddit and via email, so we will not follow up here.
Dear all, I run a BEAM accelerator with an IAEA phase space source and DOSXYZnrc with source 20 and a shared library.When the DOSXYZnrc Photpn splitting number set ≤1,the parallel jobs make sucessfully.However,when the Photpn splitting number set >1,the single thread make it,parallel jobs cannot run sucessfully.When I submit parallel jobs,all of them are working at a short time and failed. The egslogfile is end at 'will perform charged-particle range rejection against voxel bounddaries'. I tried different way,increase or decrease the number of parallel and history,but still no working.Any adea about this issue?
Dear all,
I use EGSnrc 2020, so far everything works properly with an exception to running parallel jobs at the beginning, similar to the error reported here. I have tried with different user codes (beamnrc, dosxyznrc, and cavity) and had the same error. Our system is slurm, and when I submit parallel jobs, they are not working at all and get the error below:
***** Error:
Failed to create a lock file named /home/aabuhaimed/EGSnrc/......lock
***** Quiting now.
and I get the lines below in the each error file:
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file after 1 minute wait!
The lock file is created by the first job in the same directory, but it is empty and the jobs do not run. I tried different ways, but still not working. Any idea how to fix this issue?
Was there a resolution to this issue. i seems to have the exact same issue.
There are two solutions, in batch_options there should be a wait time between jobs control, for slurm try setting this to something large like 5 seconds (make sure it's not a multiple of 12, primes probably are better here)
Or you could try https://github.com/crcrewso/EGSnrc/commit/247567404d0e5a7aeddc69b0418bbfda359e475a
Edit, forgot I submitted this as PR #499
I tried crcrewso@2475674 and it worked like a charm. thanks
I see that #499 was merged and there were additional improvements regarding lock file issues in Release 2021, notably the uniform run control object (#588) and the new egs-parallel
scripts (#628). Hence I will close this Issue for now. Don't hesitate to reopen it if the infamous lock file rears its ugly head again 😄 .
I'm seeing the same behaviour that @ojalaj reported in the comments of PR #368: parallel jobs end with an error ("failed to lock or rewind the control file")
For what it's worth, I see it when running a BEAM accelerator with an IAEA phase space source and also when running DOSXYZnrc with source 20 and a shared library (either BEAM or external).
Not sure if this is general lock file troubles or if there's anything else going on.