Closed serverhorror closed 8 years ago
This are the affected line:
To me the best solutions is to use a groovy template, otherwise we can use a bash script to create the template. (example: https://github.com/ssadedin/bpipe/blob/master/bin/bpipe-torque.sh#L170-L171)
Please try this version of the SGE command executor:
https://github.com/tucano/bpipe
(master branch)
Example of cmd.sh:
#!/bin/sh
#$ -S /bin/sh
#$ -cwd
#$ -N "hello"
#$ -terse
#$ -V
#$ -o .bpipe/commandtmp/1/cmd.out
#$ -e .bpipe/commandtmp/1/cmd.err
#$ -l slots=1
echo "Hello World";
Please note that in this driver you can specify additional options using bpipe.config:
See:
To add options to the qsub command "inline" (not in cmd.sh), use the bpipe.config key sge_request_options
Example:
executor="sge"
sge_request_options="-q queue_name"
this append the queue to the qsub request
qsub -q queue_name
``
see also: https://github.com/tucano/bpipe/blob/master/src/main/groovy/bpipe/executor/SgeCommandExecutor.groovy#L109-L111
Cheers
I think the fundamental problem is that there is no separation of concerns if you simply append to an SGE script. There is no way for Bpipe to understand that there is a reason that an actual script might exit early (for good reasons), it doesn't matter if that is successfull or not. Bpipe just won't know about that.
Please don't force the #$ -V
on users. It will break things in certain environments...
-V is essential for job scripts to inherit a functional environment, particularly when you use something like the "environment-modules" system on most clusters.
I think we are digressing :) as this is not the actual topic. Rather that exiting early in scripts will cause Bpipe to wait forever on the cmd.exit
file because it never gets created (and it is my fault as I started the -V
thing)
Regarding -V
: I beg to differ.
At our site we advise users to have their scripts "self contained", meaning to actually use module [purge|load|...]
and not to rely on implicit environment variables. And I know at least 2 other sites (outside the organization I'm currently working for) that try to do the same. It also helps to have more reproducible results as there is less questioning in terms of:
OK so you used tool X but:
- which version,
- which compiler and
- which MPI libraries?
It is much easier to reason about the behaviour/results if it is done explicitely. The thing is: I don't know how to deactivate that if it is in the default template, as far as I know one can only activate it but not deactivate it...
Also -V
with SGE+bash+defined functions will trigger and old bug (ca. 2009) because the environment file of SGE is in a "KEY=VALUE\n
" format (one entry per line) and bash functions are exported in this form:
BASH_FUNC_module()=() { eval `/usr/bin/modulecmd bash $*`
}
The problem is that on the execution host bash tries to set a variable to this (yes the closing brace is missing):
BASH_FUNC_module()=() { eval `/usr/bin/modulecmd bash $*`
It doesn't help to declare a function in the bash shorthand form (mind the semicolon before the closing curly braces):
module() { eval `/usr/bin/modulecmd bash $*`; }
The way it is exported in the environment still contains the newline. This leads to messages like:
-bash: module: line 1: syntax error: unexpected end of file
-bash: error importing function definition for `BASH_FUNC_module'
Depending on the configuration of SGE the module functin might still be available because after bash showing the error it might load the initialization files if bash has been added to the login_shells
. Then it will somewhat reliably have deterministic behaviour and at least show the error every time and later "overload" the module function with whatever comes from the initialization files.
Without bash in login_shells
of SGE it depends on how different users submit different scripts. Some might have customized their shells enough to always be a login shell and some might not have done that. This makes it hard to exchange scripts among users.
These 2 seem to be related:
Agree, #142 is related to your report
I just saw this:
And I think since Bpipe is waiting for the job to finish anyway why not use the "-sync y
" option. This way qsub will stay in the foreground and actually report back the return value from the original script.
There might be some logic required to do that with stuff like Bpipe.run { deadLock * 50 }
. But it seems that all that is need is to watch for the qsub
s to come back?
(Looked at the code for the first time today so forgive me if that idea is way off).
Wow -sync y seems like an ideal option for bpipe to handle:
-sync y[es]|n[o]
Available for qsub.
-sync y causes qsub to wait for the job to complete before exiting. If the job completes successfully, qsub's exit code will be that of the completed job. If the job fails to complete successfully, qsub will print out a error message indicating why the job failed and will have an exit
code of 1. If qsub is interrupted, e.g. with CTRL-C, before the job completes, the job will be canceled.
With the -sync n option, qsub will exit with an exit code of 0 as soon as the job is submitted successfully. -sync n is default for qsub.
If -sync y is used in conjunction with -now y, qsub will behave as though only -now y were given until the job has been successfully scheduled, after which time qsub will behave as though only -sync y were given.
If -sync y is used in conjunction with -t n[-m[:i]], qsub will wait for all the job's tasks to complete before exiting. If all the job's tasks complete successfully, qsub's exit code will be that of the first completed job tasks with a non-zero exit code, or 0 if all job tasks exited
with an exit code of 0. If any of the job's tasks fail to complete successfully, qsub will print out an error message indicating why the job task(s) failed and will have an exit code of 1. If qsub is interrupted, e.g. with CTRL-C, before the job completes, all of the job's tasks will
be canceled.
This would also resolve the issue I've run into that jobs submitted by bpipe don't get cancelled when I stop bpipe.
So if we get "job template scripts" as suggested in #151 this would basically just solve all 4 issues... (Trusting that http://superuser.com/a/78470 actually works for LSF -- lsf_request_options
seems to be enough for LSF. Mind you I have never touched or seen LSF so this is just an assumption!)
-V
would just be gone as a discussion pointqueue
in bpipe.config could be handled by sge_request_options
no need for an additional parameter that might even be conflicting if sge_request_options="-q this"
and queue="that"
Am I missing something?
I have a recommendation for how to handle the "job template scripts", if we make it a groovy variable you could write your own custom one in bpipe.config, this would elimintate the need for adding a new command-line option or bundling a new file.
Thanks for all the discussion on this. I have been following along but unfortunately since I don't work with an SGE cluster currently I haven't been able to contribute much.
It sounds like the current consensus is that implementing the script for SGE to run as a template (and perhaps more generally, support templates for the command executors) will address the issue?
My initial reaction to this bug was that Bpipe should place the command to be executed in a subshell, so that a premature exit by the user script will only terminate the subshell and not prematurely end the SGE script. Obviously there are some wider issues that are not addressed by that, so the template comes into the picture too. However currently I'm not clear on what should be in the default template: would the default template simply implement the subshell idea (or something equivalent)? Or is it preferable that the user script be completely externalised (say, written to a temporary file, to be executed from within the template script)?
Happy to implement this if I can understand well enough, or to accept patch from either you if you're up for it.
So, I don't have any idea what you mean by subshell. From my perspective, a batch system takes a command exactly as though it would've been run locally, and submits it to a cluster, with the possible addition of the "job options" to hint/supply resource usage. So the job is +
As I see it, there are only two valid ways to manage batch jobs
qsub -sync y
wait for returncode method (hopefully most batch schedulers have such a feature)The current method bpipe uses allows for jobs to disappear from the batch system for a number of reasons and bpipe will wait forever for those jobs to finish.
From my perspective, I think that the introduction of job templates could significantly simplify the support for a number of batching systems, I think in most cases the only things needed to support the batch system is
I would envision a job script template to consist of a "sensible" set of a defaults for the given batch system, it could use variable expansion to stick in things like the memory requirements, number of processors, job name etc:
#!/bin/sh
#\$ -cwd
#\$ -l virtual_free=${config.memory}
#\$ -N \"$name\"
#\$ -terse
#\$ -o $jobDir/$CMD_OUT_FILENAME
#\$ -e $jobDir/$CMD_ERR_FILENAME
#\$ -l h_rt=${config.walltime}
${additional_options}
${cmd}
This would allow users to also do a wholesale replacement of the template with their own, allowing things like replacing virtual_free with h_vmem (as my cluster does) while still using the "memory" specification. Users of course would have to supply certain "minimal" configurations for bpipe to function.
I don't have experience writing java/groovy, but if you can point me to some useful introduction I can attempt a proof of concept. I'm also happy to test any code anyone else supplies on my Son of Grid Engine 8.1.8 system.
I agree with @gdevenyi.
Here's my proposal for SGE defaults:
# because this is required with the proposed method and shouldn't be accidentally removed`
sge_request_options="-sync y -o $jobDir/$CMD_OUT_FILENAME -e $jobDir/$CMD_ERR_FILENAME"
#\$ -cwd
#\$ -N \"$name\"
${cmd}
2 things keep me from creating a pull request
What really confuses me are the differences between e.g. a LSF/SGE executor and a Torque/PBS executor. It seems to me that this adds unnecessary complex code to bpipe itself as it handles 2 different codepaths for basically the same use case. But to have some progress I think that the current issue can be fixed with the template proposal and later a discussion about the different ways executors are run should take place (e.g. I don't really see a reason why there's a difference between a local command executor, ssh command executor and batch system command executor)
Further research followup:
For LSF bsub command, the equivalent to -sync y is:
-K Submit a batch job and wait for the job to complete.
In case the job needs to be rerun due to transient
failures, the command will return after the job fin-
ishes. The bsub command returns the same value as the
job upon completion. The bsub command exits with value
126 if the job was terminated while pending.
According to the PBS Pro userguide, you can use -W block=true
to make qsub wait until the job completes
http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12.1.pdf
7.7 Making qsub Wait Until Job Ends
Normally, when you submit a job, the qsub command exits after returning the ID of the new
job. You can use the “-W block=true” option to qsub to specify that you want qsub to
“block”, meaning wait for the job to complete and report the exit value of the job.
If your job is successfully submitted, qsub blocks until the job terminates or an error occurs.
If job submission fails, no special processing takes place.
If the job runs to completion, qsub exits with the exit status of the job. For job arrays, blocking
qsub waits until the entire job array is complete, then returns the exit status of the job
array
In the case of torque, it's not clear if it supports block=true, anyone have a torque cluster they can check that with?
For slurm, there isn't a direct option for sbatch I can find, however there is a kind of workaround for it here https://groups.google.com/forum/#!topic/slurm-devel/B2KdLOyN0sU
Unfortunately this breaks the nice template system we were going for. We would probably need a wrapper script which parses out the template for salloc commands, https://groups.google.com/forum/#!topic/slurm-devel/B2KdLOyN0sU
Since this sounds like it is evolving towards a more general revamp of how Bpipe is interacting with resource managers I have created a branch 'sync_rm' to track work on it.
I have a concern, however, about the proposed solution of submitting jobs using synchronous commands. The reason we have an asynchronous model at the moment is that the synchronous model requires us to run an ongoing process on the head node for each currently active job. If there are many active jobs, that adds up to a heavy burden on the cluster head node, and indeed in practise we have had a whole set of bugs / problems where people hit resource limits (file handle limits being the main one, but not the only one). My experience is that cluster admins take a dim view of people who run hundreds of processes on the head node. It's worth noting that they also take a dim view of people putting a big burden on the cluster manager by polling for the status of hundreds of jobs continuously, so it's an issue either way, just more manageable in the asynchronous case because we can rate limit the polling. I am curious what others in this thread think about this issue.
There is a separate but related stream around implementing DRMAA support, which in theory might address many of these concerns all at once, without resorting to a synchronous model for submitting the jobs.
Well I administer a few clusters (thou smaller ones), I sure do prefer -- at this time -- that Bpipe simply blocks, that is just one blocking process per pipeline instead of possibly hundreds of polls.
SGE has a way to inspect the result of a job without messing with the job at all but that requires the cluster to be configured in a way so that these informations are still there. I think the the way where Bpipe simply blocks until completion is the simplest one.
Regarding the evolution in a more general solution: I am merely pointing out that there seems to be a lot of duplication in the way executors work (especially since local and cluster executors are treated somewhat diferently, I fail to see why that is "a good thing").
I personally think that we should just go forward with this for LSF and SGE as it actually fixes a few bugs where Bpipe simply breaks -- #69, #142, #151 -- that last one was closed by @gdevenyi in favor of this one and I believe the intention was to go forward and get a working Bpipe again.)
Of course DRMAA would be nice but is there any hope it can be done in the forseeable future?
> that is just one blocking process per pipeline instead of possibly hundreds of polls.
Just trying to make sure I understand what is proposed here: if a blocking qsub is used, from how I understand it there will not be 1 blocking process per pipeline, but 1 blocking process per parallel path in the pipeline. Some people do execute pipelines that have hundreds of parallel jobs. So we could be talking about hundreds of blocking processes per pipeline. Or am I misunderstanding this?
NB: as a temporary workaround, I commited 3948dfd in the new branch to execute commands in a subshell as originally suggested by @serverhorror. If someone would test this on an SGE cluster that would be very helpful, and then I would merge that to mainline while we continue this discussion.
Thanks everyone!
Just a note that a candidate fix for this issue is committed in the new branch sge_fixes.
To save confusion, there are a couple of other unrelated fixes to SGE that I checked into master as well, these also included in the sge_fixes branch.
It would be awesome if someone is up for testing this out! I have tried it out with a toy starcluster SGE environment but I am not sure how well it generalises to a real life cluster.
The "fix" such as it is, works as follows:
I understand this is not the favored solution mentioned above (having Bpipe run a blocking process). I'm incredibly reluctant to do that because I still can't see how it won't result in potentially hundreds of Bpipe processes on the head node, and I really don't want Bpipe to get a bad name for hogging resources on cluster login nodes.
Thanks @ssadedin that looks interesting, will test!
@ssadedin after initial testing it looks like the template version works for me, I have some changes I'd like to make to the template, expect a pull request at some point.
So after playing around with the template version, it works well for me.
I would like to make some changes regarding
However in both cases these options are still "hard coded" in the executor file, from what I can tell because they're "optional" and are inserted in ${additional_options}. This doesn't work well because it means I can't change the defaults for these without digging into the source.
It looks like based on http://docs.groovy-lang.org/latest/html/api/groovy/text/SimpleTemplateEngine.html that the template system can do simple logic, which looks ideal for doing something like
PSEUDOCODE for template line <% if (config.procs) print '#\$ -pe smp config.procs ' else print '' %>
Or... more importantly, allow me to specify my memory variable <% if (config.memory) print '#\$ -l h_vmem config.memory ' else print '' %>
Does this seem sensible, or am I looking at the wrong version of groovy?
Indeed something like that should work except you need $ next to the variables and double quotes:
<% if (config.procs) print "#\$ -pe smp $config.procs" else print '' %>
This should be also be able to be expressed as
<%= config.procs ? "#\$ -pe smp $config.procs " : '' %>
I'm not completely sure how the newlines will work out in there, just possibly you need to throw in a "\n". Hopefully SGE doesn't minde the #$ options having some extra newlines between them?
I think this makes sense and I'm happy to put it into the default template.
Thanks for the testing and feedback!
Something else is bugging me a little bit. The SGE procs being expressed as "orte <number>" actually breaks the portability of the pipeline config, since other executors obviously don't understand the "orte" (or "smp" or whatever parallel environment is used). So I am wondering if we could change that to be two options?
sge_pe = "smp" procs = 1
Then the "procs" is portable between clusters, and for most users I think putting one global "sge_pe" setting at the top of their config will be all they have to do to port a bpipe.config to SGE.
Any thoughts about that as a change? I think we could shield it from being a breaking change by making Bpipe auto-parse the old form (procs="orte 1" => sge_pe="orte", procs=1).
If we have a template why would we need any configuration at all as bpipe parameters?
I think there's no way to satisfy all the SGE configurations so why not just let it be defined by the user -- a qsub foo.sh
works for almost any script if the script doesn't SGE directives, anything else is "special site configuration" in my book. So why even bother and think about sane defaults for the world?
What @ssadedin is referring to is that the procs setting in pipeline configuration. On "local" configs its just a number, but on SGE you have to specify the parallel environment as well. Personally I'd like to be able to switch a pipeline to be local or cluster with a minimum of config changing. His proposed changes sound like a good idea.
@ssadedin followup re: parallel environment specification.
The current Sge executor requests SLOTS=N if a parallel environment is not specified (which is often the case for a single threaded job). SGE rejects these jobs because you must use a PE, even if SLOTS=1.
I got around this by commenting out the additional_options line in that test so that jobs that don't explicitly have a PE specified get submitted without slots.
The current Sge executor requests SLOTS=N if a parallel environment is not specified (which is often the case for a single threaded job). SGE rejects these jobs because you must use a PE, even if SLOTS=1.
This behaviour was in the original implementation, which was coded by a contributor (Paolo). Since I don't have real experience with SGE I didn't want to disturb it as a I assumed it serves a useful purpose. Yet like you, I find that as it is, a naive script that only sets "procs=n" simply doesn't work with an SGE cluster in its default setup due to this setting. So arguably this is something worth changing. Perhaps it was specific to Paolo's environment and not typical of most people's clusters.
In other words, I think I am arguing that we should default sge_pe to "orte" and only do the SLOTS=N behavior if the user specifically sets sge_pe = blank?
Ok, I pushed yet another change to the sge_fixes branch. Now if you specify in bpipe.config:
sge_pe="smp" procs=1
Then it will send these as a parallel environment "-pe smp 1". In the off chance that anybody is using slots directly, it does that still if you ONLY specify procs. But if you put sge_pe as a global default in your bpipe.config it can now apply to your whole pipeline. And it should still support the combined form of procs, "procs='smp 1', though I would consider that deprecated.
Additionally, I took the opportunity to remove the additional_options and now all the options are constructed inside the template. So it is a bit more customizable now.
Thanks @ssadedin these new changes work as expected. I love the templating system, this is absolutely the way forward for implementing LSF, Torque, SLURM, Condor etc in a much more flexible way with far less code.
I have one other request regarding the parallel environment changes. Right now, if a job is single threaded it still gets #$ -pe smp 1
added to its job submission. For my cluster this is okay, as my SMP parallel environment is fully available across the same slots as the rest of the cluster, however there is the flexibility in SGE to restrict PE's to machines or subsets of a cluster, which other environments might use. If I'm truly running a single-threaded job, I'd like to avoid requesting a PE. Is that an option in the code? Or might the template engine support an if config.procs > 1
construct? I can't find appropriate docs on the language for the templating to check.
Thanks for the testing @gdevenyi - so great to have someone who can try all this out on a real cluster and give constructive suggestions.
Or might the template engine support an if config.procs > 1 construct?
That sounds like a great idea. I also like that it means that a user just trying out Bpipe can try it on their cluster without specifying a PE at all. Only when they specify procs>1 will they have to figure that part out. I'll add it in.
Oh yes of course! Sometimes clusters don't have a PE at all, or have a PE named differently than smp. Currently all jobs would fail for this config as well :)
Calling @serverhorror Does this need style of cmd.sh wrapper with a seperate command file fix your deadlock issues?
Calling @tucano Does this new template method give you the flexibility you need to work on your SGE config?
Calling @serverhorror Does this need style of cmd.sh wrapper with a seperate command file fix your deadlock issues?
Calling @tucano Does this new template method give you the flexibility you need to work on your SGE config?
Hello,
we just found a situation where Bpipe reproducibly hangs and basically deadlocks.
bpipe.config:
What happens is that Bpipe creates a script to be submitted to SGE (and possibly other PBS systems) that looks like the following:
Because the script exits before the exit file can be written Bpipe never detects that the script has exited and will hang indefinitely. The problem is that it is now impossible to use existing scripts that have a
set -e
somewhere in them because they will simply deadlock the whole execution.A workaround to get around that while this bug is fixed is to wrap all calls within exec in a subshell as can be seen in the
deadLockWorkaround
statement.I suggest to choose a different path than to inject arbitrary code in a script.