nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.77k stars 632 forks source link

Handle customised sbatch returning the jobid by itself #190

Closed joshuabhk closed 8 years ago

joshuabhk commented 8 years ago

Hi,

I tried to run nextflow in NIH biowulf cluster which uses slurm. It works well when I ran with local executor, but with slurm, it failed to run. Could you help to fix the problem?

Thank you!

Bong-Hyun

[kimb8@biowulf nextflow]$ nextflow run output_file.nf N E X T F L O W ~ version 0.20.1 Launching output_file.nf [warm up] executor > slurm Error executing process > 'splitLetters (1)'

Caused by: Invalid SLURM submit response: 20418660

Command executed:

sbatch .command.run

Command exit status: 0

Command output: 20418660

Work dir: /gpfs/gsfs4/users/CCBR/user/kimb8/nextflow/work/39/f1af1c47c8ee86d5a1d434026ffd36

Tip: view the complete command output by changing to the process work dir and entering the command: 'cat .command.out'

[kimb8@biowulf nextflow]$ sbatch --version slurm 15.08.10

joshuabhk commented 8 years ago

My trivial nextflow pipeline is the following.

[kimb8@biowulf nextflow]$ cat output_file.nf

!/bin/env nextflow

process splitLetters { executor 'slurm' queue 'ccr' time '1h' module 'R' memory '16 GB' cpus 4

output: file 'chunk_*' into letters

''' printf 'Hola' | split -b 1 - chunk_ ''' }

letters.subscribe` {println "File: ${it.name} ==> ${it.text}" }

pditommaso commented 8 years ago

You can do the following:

Change into the task working directory reported in the error message.

There you will find the '.command.run' file that is just a script used to run your SLURM job.

You should be able to launch it using the following command:

sbatch .command.run

Then you should be able to debug why SLURM is failing.

Note: the SLURM directives are at the top on the .command.run file. Likely there's something there that is not valid for your cluster configuration.

pditommaso commented 8 years ago

The user reported:

Thank you very much your help! It looks like the job actually submitted but the standard output of the sbatch is different from what you expected in the nexflow.

The following is my command line input and output from the sbatch.

[kimb8@biowulf 786dec80181b63b8dcec95a2fc686b]$ sbatch .command.run 20430969

It might be the version issue or our version on biowulf2 cluster has slightly modified version of the output. In any case, can you help me to run nextflow on our cluster?

pditommaso commented 8 years ago

Yes, nextflow expects that SLURM returns the string Submitted batch job <JOBID>.

Ideally would quite easy to adapt it to handle your output. The relevant code is this method.

However I'm wondering if there's any option that could be used to restore the default output in your cluster configuration. Could you try to investigate with your sysadmins?

joshuabhk commented 8 years ago

Thank you! I already submitted a ticket about it, but somehow they are that responsive. But at least I can set up my pipeline though this temporary patch!

On Mon, Jun 27, 2016 at 5:22 PM Paolo Di Tommaso notifications@github.com wrote:

Yes, nextflow expects that SLURM returns the string Submitted batch job

. Ideally would quite easy to adapt it to handle your output. The relevant code is this method https://github.com/nextflow-io/nextflow/blob/master/src/main/groovy/nextflow/executor/SlurmExecutor.groovy#L98-L108 . However I'm wondering if there's any option that could be used to restore the default output in your cluster configuration. Could you try to investigate with your sysadmins? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/issues/190#issuecomment-228879614, or mute the thread https://github.com/notifications/unsubscribe/AGkVDACO-nmTo6lwF6zD4ov7gJxuaODgks5qQD8XgaJpZM4I_Zwj .
pditommaso commented 8 years ago

This means they have fixed the issue? How they patched it?

joshuabhk commented 8 years ago

No, I asked the admin to look into it. But he has not responded yet.

On Tue, Jun 28, 2016 at 5:46 AM Paolo Di Tommaso notifications@github.com wrote:

This means they have fixed the issue? How they patched it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/issues/190#issuecomment-229002683, or mute the thread https://github.com/notifications/unsubscribe/AGkVDPrivuHt9lOu0xO6WtTWCV5vqa_pks5qQO1kgaJpZM4I_Zwj .

joshuabhk commented 8 years ago

I am planning to compile the temporary fix by myself today.

On Tue, Jun 28, 2016 at 8:49 AM Bong-Hyun Kim joshuabhk@gmail.com wrote:

No, I asked the admin to look into it. But he has not responded yet.

On Tue, Jun 28, 2016 at 5:46 AM Paolo Di Tommaso notifications@github.com wrote:

This means they have fixed the issue? How they patched it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/issues/190#issuecomment-229002683, or mute the thread https://github.com/notifications/unsubscribe/AGkVDPrivuHt9lOu0xO6WtTWCV5vqa_pks5qQO1kgaJpZM4I_Zwj .

pditommaso commented 8 years ago

That sounds good. If it could manage the standard SLURM output and your one I could take in consideration a pull request for that.

wresch commented 8 years ago

non-responsive sysadmin here - sbatch on our system is a wrapper around SLURM sbatch hence the failure. Not particularly familiar with groovy but looks easy enough to patch.

joshuabhk commented 8 years ago

Hi Wolfgang,

You are actually the most responsive sys admin I've ever met. :) Thanks a lot for your hard work!

Best,

Bong-Hyun

pditommaso commented 8 years ago

@wresch is there a specific reason to drop the default sbatch stdout message? Could not it done in an optional manner?

wresch commented 8 years ago

@pditommaso Just outputting the jobid is a convenience that makes things like

jobid1=$(sbatch ...)
sbatch --dependency=afterany:$jobid1 ...

trivial.

We could optionally restore the default output but it'd rather adopt the tool to the cluster than the other way around.

OT: I have a couple of questions about the install - is the best place for that the google group?

Thanks for your help, btw

pditommaso commented 8 years ago

If so, IMHO to maintain the compatibility with other tools (not just nextflow) you call your wrapper with a different name e.g. qbatch (just saying). Doing so a user would have the option to choose which one to use.

If that's not an option I can take in consideration to add the support for your output version.

wresch commented 8 years ago

That's not an option. The wrapper does other things we depend on and this would be too confusing for our hundreds of users. I'm fine patching this locally if you'd rather not take the pull request.

pditommaso commented 8 years ago

OK. For the immediate I would suggest you patch it locally (let me know if you need help with the build procedure).

Then, let me think a bit if I can find a better alternative to this patch.

wresch commented 8 years ago

Sounds good. I have it working locally now. Thanks for your help

joshuabhk commented 8 years ago

Thank you so much for both of your helps! I will present nextflow to my colleagues so that they can use it.

On Wed, Jun 29, 2016 at 3:18 PM wresch notifications@github.com wrote:

Sounds good. I have it working locally now. Thanks for your help

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/issues/190#issuecomment-229459973, or mute the thread https://github.com/notifications/unsubscribe/AGkVDIQLNu0Sql9rH6GW8bw1yOp_NBlgks5qQsUegaJpZM4I_Zwj .

pditommaso commented 8 years ago

Fixed the latest snapshot. You may want to give a try running

NXF_VER=0.20.2-SNAPSHOT nextflow run ..