Implement multi-threaded submission

GandalfTheWhite2 commented 5 years ago

Before using pyHepGrid, I used simple python/perl scrpts to run my submissions in parallel - when submitting to ARC sites this can speed up the submission by a factor ~5 at least. Essentially, I did this by first writing all the necessary input files and xrsl job descriptions to a temporary local directory, and then running the submission in ~5 parallel threads per CE used. It just needs unique names of each input and xrsl job description file. I suggest implementing this in pyHepGrid, at least for ARC. Python has good support for threads. Please let me know if this would be considered useful (it may need some significant code reorganisation, and I don't want to start if it does not have a chance of being merged).

marianheil commented 5 years ago

This could actually be useful. The current submission is kind of slow, submitting 1k jobs takes ~25 minutes. I wasn't sure how much speed is actually possible with parallel submission.

The implementation itself shouldn't be too hard, it is all in run_wrap_production and run_wrap_warmup of runArcjob.py. There xrslfile is already a unique filename. So maybe one could just make xrslfile a list and create a starmap for _run_XRSL (this would replace the loop over the seeds/sockets).

scarlehoff commented 5 years ago

I think it should be a option to be turned off and on. I want to remember @DWalker487 got in trouble once for spamming the CEs (where trouble means the IT guy sends you an email) but we have gotten in trouble for so many different reasons I might be misremembering...

That said, even if it is something you can only use on the weekends when nobody's looking, it is still a nice feature to have.

GandalfTheWhite2 commented 5 years ago

This wouldn't spam the ces, they are build to withstand submission from multiple sources or users simultaneously. They have capacity to handle job submissions far faster than a single thread lets them - I don't know why ARC is built like that, but it is. This will not create problems for the CEs (as long as the overall job count is kept below ~10-20k, including finished jobs (which are kept on the CEs for a week, or until they are 'arc cleaned')

GandalfTheWhite2 commented 5 years ago

I don't know about the list and starmaps, but I presume the parallel submission is most important for the production runs. The submission takes roughly 1 sec per job, but is constant up to ~5-10 threads per ce, so if one was smart, one would submit to both ce1 and ce2, and in parallel. We plan to upgrade ce2 in the next few months. currently, ce2 is much slower than ce1 (it's slower hardware, and it has to also support ce3 and ce4 which do atlas jobs)

scarlehoff commented 5 years ago

This wouldn't spam the ces, they are build to withstand submission from multiple sources or users simultaneously.

We have managed to test the limits of the various grid infrastructures in the past. That's why I was mentioning @DWalker487 in case he tried, he lived more in the edge of the law than I did :P

In any case, as @marianheil mentioned, parallelizing run_wrap_production should be a safe change to do and a very nice feature to have.

I'm surprised that Arc still doesn't allow to send jobs in batches like Dirac though, it should be a basic feature to have :_

GandalfTheWhite2 commented 5 years ago

That would be too useful. Can't have that.

DWalker487 commented 5 years ago

I've monitored arcsub before and it is multithreaded internally which may have been the cause of issues. I think our arc submission might have been multithreaded at one point, but with a default of 15 threads which could have been too much with the internal multithreading and lead to a slowdown. I've multithreaded it now, defaulting to 10, if people are happy testing it. Seems to be a bit quicker.

DWalker487 commented 5 years ago

See PR https://github.com/scarlehoff/pyHepGrid/pull/31

DWalker487 commented 5 years ago

Looking at comments in the code, I think this actually wasn't implemented due to issues with locking the .arc/jobs.dat database file. I think it was one of those issues that just disappeared one day (possibly with an arc upgrade) or that Juan had and I didn't for some inexplicable reason...

marianheil commented 5 years ago

The database locking is still a problem: I just tested 4ca82e9eece75976dbee9c61be7ab91a4034458e with 5k dummy jobs on 30 cores and I get (once):

ERROR: One or multiple job descriptions was not submitted.
Warning: Unable to open job list file (/mt/home/mheil/projects/Wjets/.arc/jobs.dat), unknown format
         To recover missing jobs, run arcsync

So occupationally we would loose a job for the arc database. This might be worse if the disk is running slow.

I think we would also just call arcsub with multiple files and arcsub will use multiple cores. Alternatively we could use an temporary jobs.dat for each core and "merge" them afterwards (somehow?).

scarlehoff commented 5 years ago

If arcsub is doing the multithreading by itself, it should be the preferred method.

Otherwise, having different .dat files is not an issue for pyhepgrid (you just have to keep track of them and merge them afterwards as @marianheil suggests).

Then there's the thing that it only happened once for 5k jobs so we might bruteforce it and just resubmit the ones that failed?

DWalker487 commented 5 years ago

A couple of thoughts that I've relayed to Marian but should probably note here as well

Putting jobs.dat on the scratch will probably help this due to r/w speed
Keeping the core count lower (10-15) will also help a lot

We could even monitor for the issue and immediately flag the seed as failed (null value in jobid) I'm also sceptical that arc actually multithreads properly and doesn't just submit sequentially... This can be tested though, which I'll try and do now

scarlehoff commented 5 years ago

However, I would say that the fact there is a lock means that one of these statements (or both) are not true:

1 - Multithreading works 2 - The lock works

Unless the lock only happens for a small portion of the submission time and then my question is: is the arc code public? Should be consider it a "them" problem?

DWalker487 commented 5 years ago

Could just be a filesystem lock rather than an arc lock, which arc doesn't know how to handle?

Unless the lock only happens for a small portion of the submission time and then my question is: is the arc code public? Should be consider it a "them" problem?

The lock only applies for a very small period of time (from Marian's tests, only 1 lock issue in 5k jobs on 30 cores implies this to me). It's hard to tell from my tests whether arc really multithreads, but looking at htop, I only have one thread with any CPU usage, with the others having minimal footprint. I also never get two job ids printing at the same time, they're always at a steady rate, which implies only one is submitting at once?

DWalker487 commented 5 years ago

is the arc code public? Should be consider it a "them" problem?

Arc source is here. https://source.coderefinery.org/nordugrid/arc

arcsub is here: https://source.coderefinery.org/nordugrid/arc/blob/master/src/clients/compute/arcsub.cpp

I'm not a c++ master, but it doesn't look internally multithreaded to me due to the for loop...

scarlehoff commented 5 years ago

They have a lock-system in place for thread management (look for Thread.h). The thread management is file-system dependent (it always is ofc) so it will happen even if the two instances of arc are separated.

Disclaimer: I don't know which "arc submission plugin" do we normally use so I cannot / want not follow exactly how it happens and I cannot test to play. The submission happens several levels below arcsub.cpp anyway and it is not pretty.

It seems they do a lot of initialization before they ever have to deal with the fact that two instances might be running at the same time. This would explain why @GandalfTheWhite2 sees an improvement, even if there is no two submissions at once there would be a bunch of other things happening in parallel.

So the threads themselves, be it arc-generated or pyhepgrid-generated will abide by the lock when they can, which explains why @marianheil sees only one error in many K -> looks like a collision because of the filesystem. So I now believe the following that @DWalker487 wrote is true:

Putting jobs.dat on the scratch will probably help this due to r/w speed
Keeping the core count lower (10-15) will also help a lot

and that

We could even monitor for the issue and immediately flag the seed as failed (null value in jobid) I'm also sceptical that arc actually multithreads properly and doesn't just submit sequentially... This can be tested though, which I'll try and do now

is a very good compromise solution.

ps: A code in C++ written by people who think that making it convoluted is a good thing and when one of the first lines comments I read ensures compatibility with Java sounds like a nightmare.

ps2: maybe everything I said is submission-plugin dependent and everything is wrong. My guess is that it will be correct for gridftp but who knows.

DWalker487 commented 5 years ago

Thanks for doing the dive Juan, it doesn't sound so fun... At least the way we're doing things currently isn't so bad, and it's really a filesystem problem more than anything as I understand it.

We could even monitor for the issue and immediately flag the seed as failed (null value in jobid) I'm also sceptical that arc actually multithreads properly and doesn't just submit sequentially... This can be tested though, which I'll try and do now

is a very good compromise solution.

I've implemented this using the arcsub return code (see https://github.com/scarlehoff/pyHepGrid/commit/9acb258fdadcb012dcf4683b6c2106b0a742a0b8). I've not been able to test that it works in case of failures as I've not replicated the issue however (Marian is probably the one to do this).

Putting jobs.dat on the scratch will probably help this due to r/w speed Keeping the core count lower (10-15) will also help a lot

I'm not sure how this can be enforced apart from as good practice (core count can be hard-coded and not exposed in the header but I'm instinctively against that). I wouldn't want to force the location of jobs.dat, maybe a warning/note in the README?

scarlehoff commented 5 years ago

I'm not sure how this can be enforced apart from as good practice (core count can be hard-coded and not exposed in the header but I'm instinctively against that). I wouldn't want to force the location of jobs.dat, maybe a warning/note in the README?

I would say good practice should do it. I mean, using too many cores in a system with many users is unpolite anyway...

DWalker487 commented 5 years ago

And it does give us someone else to blame if things go wrong...

I've just exposed the arc submission threads to the user header (https://github.com/scarlehoff/pyHepGrid/pull/31/commits/cec30ca166430f8d5cb102e4b9d1a07b6fbc7f39). Now people can control the behaviour themselves, even setting to 1 to avoid the problem altogether.

marianheil commented 5 years ago

The "failed" jobs from arc are correctly submitted but they are not correctly stored in the database. So just setting jobs with non zero return code from arcsub to failed, as in 9acb258fdadcb012dcf4683b6c2106b0a742a0b8, isn't correct. The important part here is: To recover missing jobs, run arcsync. So the jobs are out there, running, living a happy life; we just don't know about them.

This could be problematic if one tries to resubmit all failed jobs, because some seeds would then run twice. We should rather store them as "unknown status" or so.

scarlehoff commented 5 years ago

Oh, I misunderstood the msg then. If you run arcsync you recover them? Because if you do just sending them blindly and running arcsync at the end of any multithreaded submission should do...

marianheil commented 5 years ago

Yes if you want to get any information from the job, arcsync works. However as far as I understand the sql database, pyHepGrid stores which job id belong to which run, which is a little bit harder to recover. I wouldn't bother too much, just save the failed submission under a different status.

GandalfTheWhite2 commented 5 years ago

Hi again. Apologise, I missed the start of the discussion on failed job submisions - in my original submission scripts, I would check the return code of arcsub, and if it was different from 0 I would immediate submit again (up to 10 times for each job). Only the successful submission should then return a job number, which can be saved in the DB? Or what is the problem discussed?

DWalker487 commented 5 years ago

As I understand it, I think the question boils down to:

Does the arc job lock mean the job isn't submitted successfully? In which case either mark as failed or keep resubmitting until it works, either is acceptable imo.

or

Is it successfully submitted and just not recorded in the database properly?

In which case we have a second question:

Can it be recovered with arcsync or not? If so, can we extract the job url from the arc database for insertion into the pyHepGrid database? This may require a before/after read of the arc jobs database to see which new additions can't be accounted for.

I think this requires testing to determine for sure (or we decide that this 1/5k jobs edge case isn't worth the hassle and we ignore the issue...) @marianheil . Maybe one could force it to occur more frequently by having a bash script keep touching the jobs.dat file, or somehow better force a filesystem lock.

I'd also note that arcsyncs take a long long time for me (even with different jobs files for each set of runs), and would likely negate the benefits of multithreading substantially. One could set up an automated clean beforehand to mitigate this, but that breaks stats reporting in pyHepGrid if statuses aren't checked regularly and cached for completed jobs.

GandalfTheWhite2 commented 5 years ago

The run-time for arcsync is related to the number of your jobs in the ARC DB and on the CE. The job info is removed on the CE after ~1week in order to restrict the DB size. You can reduce the number of jobs assigned to you by ensuring any jobs you download (arcget) are removed from the CE (arcclean gsiftp:// or 'arcclean --status=FINISHED -c ce1.dur.scotgrid.ac.uk' after they have been downloaded. The latter might be dangerous if jobs finished inbetween checking and executing the command, but before they are downloaded). If your local arc db isn't cleaned regularly, then arcsync might also keep rechecking long finished job IDs? 'arcsync -f' overwrites the local file with the job list available at the CE. arcsync on a reasonable setup shouldn't take more than a few seconds.

One shouldn't 'arcsync -f' shortly after submitting jobs, since the newly submitted jobs will not appear in the presented results until after ~5 minutes. (The arc DB presented to the interface is updated only every ~5 minutes to limit the load on the real DB).

jcwhitehead commented 5 years ago

Think I've caught up with this thread. I've occasionally had a jobs.dat database files grow over time and become corrupted. It might just be that ongoing database sanitisation would fix both that and long arcsync times.

Do we ever clean the jobs from our local database that have been purged from the CE (ie that are a week old)? If not that would be an easy way of keeping our local database (relatively) tidy, with no risk of removing useful data.

The 'ARC' solution is to use arcget to retrieve job output, which automatically removes the job from both the local and CE database once its output has been copied successfully - I don't think we would want to replicate that in our finalise routines, but the ARC infrastructure presumably assumes that completed jobs are fairly promptly removed from the databases.

GandalfTheWhite2 commented 5 years ago

Ah, yes, arcget removes the jobID, unless --keep is used.

scarlehoff commented 5 years ago

My solution to oversized databases was to create a routine of monthly cleaning (the option --clean will deal with that in a job by job basis)

marianheil commented 5 years ago

Let's get back to this: I think we can accept @DWalker487 merge request. The only thing I would change is statuses = [self.cUNK if i != "None" else self.cFAIL for i in joblist] in line 267 of src/pyHepGrid/src/runArcjob.py.

It shouldn't be cFAIL but either cUNK or its own status cUNKSUB. cFAIL means that the job itself failed, which is not the case, just the saving to the arc database failed. The job is running correctly. We can even recover the job with arcsync (something for another issue). The problem with cFAIL is one could resubmit the same seed again, even if the job succeeds.

DWalker487 commented 5 years ago

Just a follow up, the above change from Marian would also require a change to the top of the Backend class to add cUNKSUB (so filtering doesn't break). The only other thing might? be to update the status printing to include this new option.

GandalfTheWhite2 commented 5 years ago

I just wanted to add that since now using the parallel submission it becomes clear that it is indeed too fast if submitting to just one CE. It was submitting so fast in fact, that sometimes the entropy on the CE wasn't enough to update the job string (gsiftp://....) between two submissions, such that two jobs had the same string, resulting in some of the jobs failing. The solution seems to be (?) to use both ce1 and ce2 - maybe because of the mechanism of choosing between them requires a random number, but introduces a delay and adds entropy....

marianheil commented 5 years ago

Does this need more work or is it ok when setting split_dur_ce=True? Also on how many threads did you submit at once? Maybe we should set the arc_submit_threads lower.

GandalfTheWhite2 commented 5 years ago

I think under normal circumstances it should be OK with split_dur_ce=True and arc_submit_threads to 5 (not 10). That's what I just used, and it submits 1200jobs in under 8 minutes.

scarlehoff / pyHepGrid

Implement multi-threaded submission #29