Parallel jobs fail to lock or rewind the control file at end of run

mchamberland commented 6 years ago

I'm seeing the same behaviour that @ojalaj reported in the comments of PR #368: parallel jobs end with an error ("failed to lock or rewind the control file")

For what it's worth, I see it when running a BEAM accelerator with an IAEA phase space source and also when running DOSXYZnrc with source 20 and a shared library (either BEAM or external).

Not sure if this is general lock file troubles or if there's anything else going on.

mchamberland commented 6 years ago

To clarify: the "failed to rewind the control file" does not necessarily happen at the end of the run. It can happen anytime throughout the run, which means if I submit 250 jobs, I may be left with only ~40 after a few minutes.

The message I see is:

... egsLockControlFile: failed to lock file for 12 seconds... egsLockControlFile: failed to lock file after 1 minute wait!

which is what seems to be killing the majority of my jobs.

Completely anecdotal, but I don't see this as often with a ~2013 version of EGSnrc, but I haven't ruled out this being a user-account issue (I run the 2 versions of EGSnrc on different user accounts for now).

ojalaj commented 6 years ago

I have also experienced this race condition of the lock file. There seems to be at least two separate issues: 1) jobs really killed during the simulation and 2) jobs running properly in terms of calculation results, but something going wrong in the last parts of finishing scripts. My knowledge for this is very limited, but as a cure 1) I have tried reducing N_CHUNKS in src/egsnrc.macros, so that the jobs wouldn't access lock file so often and increasing batch_sleep_time in scripts/batch_options.xxx, so that jobs wouldn't start (and access lock file to pick up the next chunk of particles) within such a short time frame.

Do you know, where is defined the value (time) for how long the lock file is accessed until killing the job?However, increasing the value might also help.

mchamberland commented 6 years ago

Thanks, those are good suggestions!

This is where the code tries to lock the file for 1 minute. Maybe I’ll try slightly increasing those numbers.

marenaud commented 6 years ago

I just want to echo @ojalaj and say that I also encountered the two different flavors of this issue, and also fixed 1) by increasing batch_sleep_time which is annoying when splitting a < 2 minute job in 100+ parts because the batch_sleep_time ends up being a significant overhead. This has been happening for quite a few releases, so probably not due to a recent bug.

The best solution would probably to finalize #341 and do away with lock files entirely... but when I looked into it, there seemed to be some issues with shared libraries.

mchamberland commented 6 years ago

I think I may have found why v2018 experiences more lock file errors than our ~2013 system.

In egs_c_utils.c, the 2013 version tries locking the file for 10 minutes before giving up, whereas in v2018, this is reduced to 1 minute. After increasing the time back to 10 minutes in v2018, I see much fewer failures (2 out of 250 vs >150 before).

I would personally suggest increasing the time back to 10 minutes because the drawback from many failed runs is worse than having a job wait a few minutes to properly lock the lock file.

ftessier commented 6 years ago

Thanks @mchamberland; but that must have been a local change: from our end it seems it was always 1 minute... At any rate, seems there would be no harm done by increasing the wait time.

mchamberland commented 6 years ago

@ftessier Correct, I've confirmed that the change was done locally.

Since we're here, there seems to be duplicated code in cutils/egs_c_utils.c and pieces/egs_c_utils_unix.c. Any reason why? I was also confused about which file gets picked up during configuration.

crcrewso commented 6 years ago

This explains our currently much higher failure rate! Thank you!

When looking at the file, should we increase the number of loops or increase the sleep time?

mchamberland commented 6 years ago

I don't think it matters which one you increase, but we changed it to loop 60 times every 10 seconds, so it tries for 10 minutes.

But even with this change, I was still getting quite a few jobs failing. What seems to have helped, is to reduce the rate of lock attempts. In other words, given a typical simulation, figure out how many times your jobs will access the lock file per second (usually, something like Njobs * Nchunk / typical_simulation_time). In this case, Nchunk is a good candidate to decrease.

crcrewso commented 6 years ago

I've been thinking about this for a bit and there might be a better rule of thumb than @mchamberland mentioned. If we consider an ideal scheduler, and a large number of histories per job, then at the end of the first batch, each job will try to lock at the same time. Now there are some logical leaps here, which I could explain but I'm thinking the summary should be sufficient:

Assuming a linear wait time with perfect collisions

The number of tries (t) should be greater than the maximum number of simultaneous jobs (j)
The wait time should be at least twice as long as the time taken to lock, edit, and release the lock file. (s)
The value of tj < 0.5 s

This has the advantage of after the first of 10 checkpoints all threads should be timed so differently that they should not collide again.

Now if we create a more advanced timing system, where we add a randomized broadening term we could cut down the number of loops of any given thread, but doing so would mean that we'd retain collision risks through the whole run.

I can do a bit of legwork on this if the group thinks that there would be advantages to keeping this algorithm safe, efficient, and unnecessary for users to configure.

Or we could document this bit of code in an appendix.

mchamberland commented 6 years ago

@crcrewso You've put way more thought than me into this, but I definitely support such an initiative! I can help with testing until roughly mid-August. After that, I'm not sure when I will have access to a cluster with EGSnrc again.

crcrewso commented 6 years ago

Before I write something up, one question. Does anyone know if the lock file stays locked while the results are being written?

mchamberland commented 6 years ago

EDIT: Oops, misread your question. I have no idea and that's a good question. My hunch is no, but @rtownson @blakewalters can chime in.

blakewalters commented 6 years ago

You're hunch is correct, @mchamberland. The .lock file is unlocked while writing results.

crcrewso commented 6 years ago

If that's the case then here is my proposed new algorithm (based mostly on experience and an old locking conversation from years back).

Considering most cluster storage is higher latency and RAID HDD on a storage server:

Create a new time elapsed variable (te)
try
If locked, wait a random amount of seconds between 3 and 15
increment te by that amount of time
try again
repeat 3-5 until either te reaches 120 seconds or success
if te > 120 incriment by 30 seconds
Repeat 7 until either success or te > 1200

This should introduce enough time and variance to protect us from issues. Additionally this would mean that we might not need to keep the exb default wait per job dispatch time so high. (right now it's 1 second) on clusters that are lucky enough to have fast dispatching and SSD's.

Thoughts, arguments, holes, worries?

Edit: this algorithm would work better as a while loop than as a nested for loop

mchamberland commented 6 years ago

Sounds like a good starting point to me!

ojalaj commented 6 years ago

Thank you experts for the great work. I do not have skills to contribute on the development part, but I can also do testing on our small cluster (186 cores, normal RAID HDDs, SLURM), if needed.

crcrewso commented 6 years ago

I just created a barely tested change to the locking algorithm. I can't right now test it on windows or slurm. If someone could test each of those and comment on my commit page that would be hugely helpful.

thank you --- edited to remove the bad link and replaced it https://github.com/crcrewso/EGSnrc/commit/247567404d0e5a7aeddc69b0418bbfda359e475a

ojalaj commented 6 years ago

Hi @crcrewso.

I tried to open the link provided, but page not found.

crcrewso commented 6 years ago

Sorry try this

https://github.com/crcrewso/EGSnrc/commit/247567404d0e5a7aeddc69b0418bbfda359e475a

mchamberland commented 6 years ago

@crcrewso Great work! I'll give it a shot sometime today.

ojalaj commented 6 years ago

Hi @crcrewso, we tried your script with Linux/Slurm. After inserting the changes to egs_c_utils.c, we recompiled basically in every folder under HEN_HOUSE and EGS_HOME, just to make sure that the changes would take effect. Unfortunately the failure rate did not change.

crcrewso commented 6 years ago

@ojalaj

I am, not doubting you, I just have a couple questions to track down why it didn't change anything (I would actually expect the failure rate to get worse in a certain scenario)

Before recompiling everything did you rerun the egs configure script, either the gui or HEN_HOUSE/scripts/configure?

When submitting with exb can you confirm that the jobs are actually starting and getting to their first control point? Ie does the lock file's first number ever drop? Does the second ever increment above zero?

ojalaj commented 6 years ago

@crcrewso

1) I did not run configure script (I thought that recompilation would be enough). 2) I need to confirm from my PhD student, but my understanding is that jobs actually started, i.e. 30 jobs initially, but something like 2 of them ultimately continued to simulate histories from the lock file. This is similar behaviour I have experienced in the past, but not that much these days (I'm using different cluster than my PhD student, who is really suffering from this issue). Tomorrow we can run another test to check/confirm how it goes with the lock file.

crcrewso commented 6 years ago

@ojalaj I only tested the script against full reconfiguration, I doubt running make everywhere a user thought of would be thorough enough.

On a sidenote I have had little experience with jobs successfully restarting. Please could you test from a clean submission, all lock files and temp files from previous runs removed?

ojalaj commented 6 years ago

@crcrewso Answering to the question above:

Yes the jobs actually start. The jobs that fail, output "lockControlFile: failed to lock file for 12 seconds..." to .egslog file several times and then "EGS_JCFControl: failed to rewind the job control file finishSimulation(egs_chamber) -2"

The jobs that don't fail, run as they are supposed to, i.e., the first number in the .lock file gradually drops and the second number increases and finally the simulation ends, when all the particles are simulated.

Also, the tests have been clean submissions (all lock files and temp files from previous runs removed).

Now we ran the configure script and tested again with a clean submission (30 jobs) (egs_chamber run to simulate a profile). Now about 60% of the jobs seems to fail, whereas before only couple of jobs survived. So there seems to be some improvement, but this needs still further testing.

crcrewso commented 6 years ago

@ojalaj I'm going to propose 3 easy changes to your code for testing. These should all be made together.

I've included the full line for convenience but it will probably be easier just to type the numerical changes

1, Replace + else {cycleTime = 30;} with + else {cycleTime = 15;}

2, Replace + if (elapsedTime < 120) { cycleTime = 2 + (rand() % 20); } + if (elapsedTime < 120) { cycleTime = 1 + (rand() % 16); }

3, in your batch slurm file there should be a line like batch_sleep_time=1 I know it's painful, but could you set it to 2

Let me know how that goes after you rerun the configure script

ojalaj commented 6 years ago

@crcrewso

We applied the changes and rerun the configuration script. Actually we had batch_sleep_time=5, so setting it to 2 wasn't painful. Anyhow, we still can't see much improvement....

crcrewso commented 6 years ago

Could I get your input files? Could you Try either quadroupling your number of histories or dividing the number of jobs dispatched by 4

mchamberland commented 6 years ago

@crcrewso Just to mention that I've also been following and testing your instructions to @ojalaj I'm running another test with 4 times fewer jobs to see if there is any improvement.

Interestingly, I'm running a case where I'm using an egsphant. With 140 x 140 x 83 voxels, most jobs complete properly. When the egsphant is resampled to 280 x 280 x 166, many jobs fail to unlock/lock the lock file.

So, large write operations are making this problem worse? Maybe?

mchamberland commented 6 years ago

@crcrewso Reducing from 250 to 62 jobs, they've all been running for more than 40 minutes and none of them have failed yet.

mchamberland commented 6 years ago

@crcrewso No failure with the reduced job number for this particular case.

crcrewso commented 6 years ago

I think you were running too many jobs. When determining the number of jobs to run one needs to keep in mind the time to save

Lets consider a typical cluster raidArray. It might save at 100 MBps. If were saving a 100 MB pardose file, with latency, the lock file will be locked for about 3 seconds.

250 of these would take 750 seconds at least. If theres other latencies that could easily reach the 1200 second maximum when things start timing out with my code

Does this make sense.

On July 6, 2018 at 5:09:48 PM, Marc Chamberland (notifications@github.com(mailto:notifications@github.com)) wrote:

@crcrewso(https://github.com/crcrewso) No failure with the reduced job number for this particular case.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub(https://github.com/nrc-cnrc/EGSnrc/issues/438#issuecomment-403167521), or mute the thread(https://github.com/notifications/unsubscribe-auth/AFzs33cy0fVOdacdWHmb9q0EbF_9ZweMks5uD-48gaJpZM4TqzcR).

mchamberland commented 6 years ago

@crcrewso Aaaah! Yes, that makes a lot of sense, actually. Thanks!

ojalaj commented 6 years ago

Hi @crcrewso.

We tried both quadroupling number of histories and dividing the number of jobs dispatched by 4, but result is still the same - 30 dispatched jobs and 2 of them end up running till the end. We'll figure out if we can share the input file - if not, we will make some simplified example input file.

I don't know, but could this be cluster specific, i.e., the input file will work on some cluster with not much other traffic and/or faster HW, but not on others? Our problems are on a Uni cluster with ~1500 cores with number of other users. However I'm using (I'm the only user at the moment) another small cluster with ~200 cores, where I haven't had these issues (which I had in the past on the older Uni clusters).

crcrewso commented 6 years ago

Okay. Lets try something ridiculous. Inhave one constant in the while loop with a value of 1200. Try setting this to 7200.

On July 8, 2018 at 1:29:11 PM, ojalaj (notifications@github.com(mailto:notifications@github.com)) wrote:

Hi @crcrewso(https://github.com/crcrewso).

We tried both quadroupling number of histories and dividing the number of jobs dispatched by 4, but result is still the same - 30 dispatched jobs and 2 of them end up running till the end. We'll figure out if we can share the input file - if not, we will make some simplified example input file.

I don't know, but could this be cluster specific, i.e., the input file will work on some cluster with not much other traffic and/or faster HW, but not on others? Our problems are on a Uni cluster with ~1500 cores with number of other users. However I'm using (I'm the only user at the moment) another small cluster with ~200 cores, where I haven't had these issues (which I had in the past on the older Uni clusters).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub(https://github.com/nrc-cnrc/EGSnrc/issues/438#issuecomment-403310734), or mute the thread(https://github.com/notifications/unsubscribe-auth/AFzs3--3sbL9nA1ELvPuTQI579zqeCx_ks5uEl2HgaJpZM4TqzcR).

ojalaj commented 6 years ago

I tried changing 1200 --> 7200. I'll let you know tomorrow did it help. We have prepared a test input file for you. To which address we can send it?

crcrewso commented 6 years ago

Since it's a test input file, would you have any objection posting it to https://github.com/crcrewso/EGSnrc/commit/247567404d0e5a7aeddc69b0418bbfda359e475a? Otherwise you can try -removed-

ojalaj commented 6 years ago

We have been running test simulations on two different clusters - a University cluster with hundreds of cores and number of other users with various calculation needs and a dedicated cluster (186 cores) only for these simulations with no other users. At the Uni cluster we experience the issue with large number of failing jobs with both default EGSnrc .lock file control and the changes proposed by @crcrewso.

However, on the dedicated cluster we are able to run all the simulatons with no problems with default EGSnrc .lock file control, so it seems that the issue is somehow generally related to HW/SW configuration and/or other traffic on the cluster.

marenaud commented 5 years ago

@ojalaj just for fun, do you know the file systems being used on each cluster? I'm wondering if the lock file issues are due to the use of NFS (or specific NFS settings) to share a common partition across nodes. I'd be really curious to know if the dedicated cluster where you have no issues is using something else than NFS?

ojalaj commented 5 years ago

This is something I really don't know....I need to check from admins or do you have some commands that I could use to find out? Local file system (using df -Th) on the dedicated cluster seems to be xfs.

marenaud commented 5 years ago

One of the ways would be to log onto a compute node and run df to see if you have any NFS-mounted folders (and if you can identify that you run jobs from there). For example on our cluster we share /home across all nodes via nfs, so the compute node will have a line in df that says:

controller-node:/home 8719676416 6100080640 2180126720  74% /home

where controller-node is either the hostname or IP address that hosts the disk. Another way is to look at /etc/exports to see if the node which has all the hard drives exports any drives to an nfs server.

edit: doing df -Th will also reveal whether a folder is mounted as nfs, but you'd have to do it on a compute node, not the host node.

ojalaj commented 5 years ago

I'll ask the details directly from the admins (same admins for both clusters) and then let you know more.

ojalaj commented 5 years ago

This is what I got: Both clusters are using NFS for compute-nodes. Only differences that come to my mind, is that the dedicated cluster with no issues is running on Scientific Linux 6 (using plain ethernet for NFS), whereas the other one on Centos 7 (using Infiniband IPoIB for NFS). Of course the number of clients and amount of io usage on the dedicated cluster is much smaller.

marenaud commented 5 years ago

Wow interesting. And the dedicated cluster is the one where you have no issues eh. Well, there goes that theory.

ftessier commented 5 years ago

Just a random thought: could the higher bandwith IB fabric actually be bottle-necking the NFS server?

There is an NFS setting that is apparently often overlooked, the Number of servers (see https://access.redhat.com/solutions/2216 for example). As far as I understand, this sets the maximum number of concurrent connection (implemented as daemon threads).

Keep in mind that a short simulation on a large number of nodes might be requesting many NFS connections all the time. Our experience is that NFS grinds to a halt when the number of requests is beyond what is in fact available.

You can get a sense of the server load with the uptime command.

On our 400-node cluster, we increased the number of NFS "severs" to 64, and that alone solved a lot of problems with NFS lock ups. We still bring down the NFS daemon from time to time, but not nearly as often as before...

You may want to ping your system administrator about this setting.

marenaud commented 5 years ago

Thanks Fred, that's a nice thing to try! We had the default 8 threads on our cluster, so I'll test 64 and report.

edit: didn't help :(

ojalaj commented 5 years ago

According to admins, at our end the setting has "always" been 64 and the server load has never been even near full. But thank you anyway @ftessier ! Other suggestions are also welcome!

ojalaj commented 5 years ago

I started an egs_chamber simulation on the cluster (with issues). I'm using the 'develop' branch from Sep 25th and I've changed N_CHUNKS in src/egsnrc.macros to 1 and increased batch_sleep_time in scripts/batch_options.xxx to 10 seconds.

What caught my attention was that even if I've changed N_CHUNKS ("how many chunks do we want to split the parallel run into") (and re-compiled under HEN_HOUSE/egs++ and EGS_HOME/egs_chamber), I still get the following to each .egslog file under each egsrun directory:

Fresh simulation of 2000000000 histories

Parallel run with 50 jobs and 10 chunks per job

which I understand so that changing N_CHUNKS has not had any effect. Is there something I've done wrong here? Or do I need to re-compile somewhere else?

rtownson commented 5 years ago

@ojalaj in egs++ codes N_CHUNKS is set in the run control input block:

:start run control:
nbatch = 1
nchunk = 1
ncase = etc...
:stop run control:

nrc-cnrc / EGSnrc

Parallel jobs fail to lock or rewind the control file at end of run #438