grisu / gricli

Grisu commandline client
7 stars 2 forks source link

Gricli stalling in job submission due to too many job refreshes #114

Closed vladimir-mencl-eresearch closed 13 years ago

vladimir-mencl-eresearch commented 13 years ago

Hi,

Gricli got stuck when resubmitting the last 100 jobs (out of 2000) that failed to submit (#113) - only after about 20 of them.

And watching gricli.debug, I could see LOTS of lines like

3267540 [Thread-139] DEBUG grisu.frontend.control.jobMonitoring.RunningJobManager  - Refreshing job: R-bubb-pi075-1634
3267954 [Thread-141] DEBUG grisu.frontend.control.jobMonitoring.RunningJobManager  - Refreshing job: R-bubb-pi075-1306
3267981 [Thread-140] DEBUG grisu.frontend.control.jobMonitoring.RunningJobManager  - Refreshing job: R-bubb-pi075-1204
3268040 [Thread-139] DEBUG grisu.frontend.control.jobMonitoring.RunningJobManager  - Refreshing job: R-bubb-pi075-1633

It appears Gricli is so busy refreshing job status (on the 2000 jobs it has queued up) that it can't do any useful work on submitting jobs....

yuriyh commented 13 years ago

hmm. but gricli does not check job status automatically - only when you request it to with "print jobs". Do you have multiple sessions running?

vladimir-mencl-eresearch commented 13 years ago

No, I don't .... and I'm getting those message scrolling in my local gricli.debug. So somehow the client decides to refresh the status of all jobs ... and it really slows down the job submission as time progresses.

It's actually quite bad now - I'm submitting the last 100 jobs in batches of 10-15, as Gricli slows down to a halt after that.

makkus commented 13 years ago

Not sure how much we can do in regards to batch job in the near future, let's discuss here:

https://github.com/grisu/grisu/issues/21#issuecomment-1686316

yuriyh commented 13 years ago

Markus, can you please confirm that grisu may query job status without user request?

makkus commented 13 years ago

Yep, it did. Was using the same class/methods that are used in the template client, since those existed already and allowed for background checking of current jobs, alas getting a list of jobs and their properties quicker.

But that is obviously not working for 1000s of jobs. Will re-evaluate. But we really need to have some kind of line regarding how many jobs should be submitted with Grisu/Gricli. We just can't support 1000s....

vladimir-mencl-eresearch commented 13 years ago

Hi Markus, what do you exactly mean by "we can't support 1000s" ?

If I have a use case that requires submitting 1000s of jobs, I thought GriCli would be exactly the tool to use.

Or should I be using batch jobs for that? Are they in a ready to use state now?

yuriyh commented 13 years ago

lets just say that out of existing grid tools gricli is best suited for submitting 1000s of jobs to BeSTGRID. It doesn't mean that this usecase is supported :) batch jobs are meant to be the abstraction to use, when there are many jobs. But they are very much work in progress. I hope we may have prototype gricli support in september milestone.

So, right now you are stuck with gricli and plain jobs.

makkus commented 13 years ago

Hi Vlad, did you read the link I pointed out above: https://github.com/grisu/grisu/issues/21#issuecomment-1686316 ?

Don't want to repeat everything here again, but in short: batch-jobs are a lot different to single jobs in terms of handling them with a submission tool. And a lot more fragile. So they need to be handled differently, and that's what I tried with the special BatchJob support in the grisu client library. If Grisu has to check the status for, say, 2000 jobs, then you'll see the behavior you describe in this ticket, and there's no fix. Even using the Grisu batch-job support probably won't go smoothly. For one, because it's only in an alpha stage at the moment, and also because the problem might just be too hard and we might need to look for other tools for this class of jobs. That's what I mean when I say "can't support".

Also, batchJob support is not even in our priority-list yet, there are other (equally hard) issues in the queue before we can spend some time on it. :-(

One comment about "best suited": I kept saying that quite a bit and I'll probably stop soon now because nobody ever seems to even comment on it, but I think the Jython client is better suited for this kind of problem. At least for somebody with a bit of programming/scripting experience. Since the Jython client is only a very thin layer on top of the client library, it will always be more stable then Gricli (which also sits on top of the client library -- but it's thicker) just because there's less code (Jython gets maintained externally and we can just rely on it being stable). You have more direct access to the client libraries methods so you are not as restricted as you are when using gricli and you get stuff like for-loops, if-then-else constructs, all the python goodness. Fair enough if one points out that you can't use python modules that are written in C with jython, but that doesn't take away the stuff you actually get (which is quite a lot and should be enough to handle quite complex workflows).

Because all of those reasons I think it should be a tool in the BeSTGRID Grid-Toolbox, just like Gricli. Maybe it shouldn't be that prominent and only be recommended for a bit more advanced users, but I really think it's worth supporting. Especially since we basically get it for free, all the improvements we make to the client library because of our work on Gricli will be in jython automatically... Aaron used jython for some batchjobs. I think he found it's not perfect but at least somewhat usable. And if we want to improve batch-support, the jython client would be a good thing to test improvements right away, imho.

makkus commented 13 years ago

BTW, another option to get around this is to package up jobs of say 50 into a shell-script-job-wrapper. That way Grisu only has to deal with 40 jobs and not 2000. Obviously, this might not be suited for your exact usecase, but it's worth keeping in mind. I've had good experience with that. As an additional plus it makes the overhead of submitting a job smaller in comparison to the job walltime. If your job is only 5 minutes, this is a good thing. Not, of course, if your job is 12 hours... As I said, not for everything, but worth keeping in mind.

makkus commented 13 years ago

Since this is not the only batch-job related problem: what do we do in regards to BatchJobs? I'd like to have someone tell me what our strategy is. I can continue to work on and improve the current Grisu BatchJob support. Or we just have to tell users we unfortunately can't support those kinds of jobs at the moment. Or?

makkus commented 13 years ago

Closing this since we need to have discussion about batch-jobs and what we support before we know what category/size of batchjobs we want/can officially support. Performance and scale of job submissions should have been improved in next Grisu release (due tomorrow) though...