nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.7k stars 621 forks source link

Add executor for OAR batch scheduler #708

Closed Tintest closed 5 years ago

Tintest commented 6 years ago

Hello,

as discussed by mail I would like to be able to use Nextflow with the OAR batch scheduler.

We made some change on the OAR_executor branch @bzizou, without any sucess. I hope you will be able to help us.

Thank you.

pditommaso commented 6 years ago

This is definitely possible. The first to do is to identify the command lines required to:

  1. submit a job execution
  2. check the current job/queue status
  3. kill one or more jobs.

I would to describe below this thread an example command line for each of the above point, including the exact output they produce. Then I will be able to advice you how to continue.

julienthevenon commented 6 years ago

Thanks for your answer,

  1. submit a job execution: oarsubhttp://oar.imag.fr/docs/latest/user/commands/oarsub.html

a usefull implementation would be the implementation of the oarsub --array option to parallelize multiple sample job submissions and computing filenames (

--array-param-file Submit a parametric array job. Each non-empty line of "FILE" defines the parameters for a subjob. All the subjobs have the same characteristics (script, requirements) and can be identified by an environment variable $OAR_ARRAY_INDEX. '#' is the comment sign.

)

pditommaso commented 6 years ago

Please format commands using markdown formatting to improve the readability of the text.

Then, please provide a concrete examples for those commands and above all the exact output they produce. That's important because the executor needs to parse the output text to extract the relevant informants.

Finally, let me clarify I'm very happy to accept a pull request for this feature, and I'm ready to advice you how to implement it. However I won't be able to implement it because I cannot test it.

julienthevenon commented 6 years ago

Sorry, first post on github for me. I try the following, please tell me if it is correct for you.

First we write the command to be submitted to oarsub or wrap it up in a shell script: nano test.sh echo blabla sleep 10 echo blibli

then submit the job to oar with parameters: oarsub --resource /nodes=1/core=1,walltime=00:01:10 --directory `pwd` --name bloblo --project epimed "bash test.sh" oar echoes in the shell: [ADMISSION RULE] Modify resource description with type constraints [PROJECT] Adding project constraints: (team='epimed' or team='ciment' or team='visu') OAR_JOB_ID=7157387

jobs generate 2 files: stderr and stdout -rw-r--r-- 1 ju4667th l-iab 0 May 21 18:31 OAR.bloblo.7157387.stderr -rw-r--r-- 1 ju4667th l-iab 14 May 21 18:31 OAR.bloblo.7157387.stdout more OAR.bloblo.7157387.stdout blabla blibli

The oarstat command oarstat -j 7157387 Job id Name User Submission Date S Queue ---------- -------------- -------------- ------------------- - ---------- 7157387 bloblo ju4667th 2018-05-21 18:31:24 T default The T is for Terminated. oarstat -fj 7157387 Job_Id: 7157387 job_array_id = 7157387 job_array_index = 1 name = bloblo project = epimed owner = ju4667th state = Terminated wanted_resources = -l "{type = 'default'}/network_address=1/core=1,walltime=0:1:10" types = dependencies = assigned_resources = 805 assigned_hostnames = luke41 queue = default command = bash test.sh exit_code = 0 (0,0,0) launchingDirectory = /home/ju4667th/analyses/sandbox stdout_file = OAR.bloblo.7157387.stdout stderr_file = OAR.bloblo.7157387.stderr jobType = PASSIVE properties = ((desktop_computing = 'NO') AND (team='epimed' or team='ciment' or team='visu')) AND visu = 'NO' reservation = None walltime = 0:1:10 submissionTime = 2018-05-21 18:31:24 startTime = 2018-05-21 18:31:36 stopTime = 2018-05-21 18:31:47 cpuset_name = ju4667th_7157387 initial_request = oarsub --resource /nodes=1/core=1,walltime=00:01:10 --directory /home/ju4667th/analyses/sandbox --name bloblo --project epimed bash test.sh message = R=1,W=0:1:10,J=B,N=bloblo,P=epimed (Karma=0.001,quota_ok) scheduledStart = no prediction resubmit_job_id = 0 events = 2018-05-21 18:31:48> SWITCH_INTO_TERMINATE_STATE:[bipbip 7157387] Ask to change the job state

pditommaso commented 6 years ago

Please review the GitHub markdown guide how to quote code.

Try to format it as below:

Command X:

oursub --xx --yy

Output:

foo bar
julienthevenon commented 6 years ago

thanks, another try:

First we write the command to be submitted to oarsub or wrap it up in a shell script:

nano test.sh
echo blabla
sleep 10
echo blibli

then submit the job to oar with parameters:

oarsub --resource /nodes=1/core=1,walltime=00:01:10 --directory `pwd` --name bloblo --project epimed "bash test.sh"

oar echoes in the shell:

[ADMISSION RULE] Modify resource description with type constraints
[PROJECT] Adding project constraints: (team='epimed' or team='ciment' or team='visu')
OAR_JOB_ID=7157387

jobs generate 2 files: stderr and stdout

-rw-r--r-- 1 ju4667th l-iab 0 May 21 18:31 OAR.bloblo.7157387.stderr
-rw-r--r-- 1 ju4667th l-iab 14 May 21 18:31 OAR.bloblo.7157387.stdout

output

more OAR.bloblo.7157387.stdout
blabla
blibli

The oarstat command to show a job:

oarstat -j 7157387
Job id Name User Submission Date S Queue
---------- -------------- -------------- ------------------- - ----------
7157387 bloblo ju4667th 2018-05-21 18:31:24 T default

oarstat -fj command allows full view of a job

oarstat -fj 7157387

output

Job_Id: 7157387
job_array_id = 7157387
job_array_index = 1
name = bloblo
project = epimed
owner = ju4667th
state = Terminated
wanted_resources = -l "{type = 'default'}/network_address=1/core=1,walltime=0:1:10"
types =
dependencies =
assigned_resources = 805
assigned_hostnames = luke41
queue = default
command = bash test.sh
exit_code = 0 (0,0,0)
launchingDirectory = /home/ju4667th/analyses/sandbox
stdout_file = OAR.bloblo.7157387.stdout
stderr_file = OAR.bloblo.7157387.stderr
jobType = PASSIVE
properties = ((desktop_computing = 'NO') AND (team='epimed' or team='ciment' or team='visu')) AND visu = 'NO'
reservation = None
walltime = 0:1:10
submissionTime = 2018-05-21 18:31:24
startTime = 2018-05-21 18:31:36
stopTime = 2018-05-21 18:31:47
cpuset_name = ju4667th_7157387
initial_request = oarsub --resource /nodes=1/core=1,walltime=00:01:10 --directory /home/ju4667th/analyses/sandbox --name bloblo --project epimed bash test.sh
message = R=1,W=0:1:10,J=B,N=bloblo,P=epimed (Karma=0.001,quota_ok)
scheduledStart = no prediction
resubmit_job_id = 0
events =
2018-05-21 18:31:48> SWITCH_INTO_TERMINATE_STATE:[bipbip 7157387] Ask to change the job state
pditommaso commented 6 years ago

Great, much better!

How to kill a job? Also is it not possible to define the job submit directives in the script header as, for example, with PBS (shown below)?

#!/bin/bash
#PBS -A <account_no>               (only for account based usernames)
#PBS -l walltime=1:00:00
#PBS -l select=1:ncpus=1 
#
./my_application
julienthevenon commented 6 years ago

Here is the oardel command:

$oarsub --resource /nodes=1/core=1,walltime=00:01:10 --directory `pwd` --name bloblo --project epimed "bash test.sh"
[ADMISSION RULE] Modify resource description with type constraints
[PROJECT] Adding project constraints: (team='epimed' or team='ciment' or team='visu')
OAR_JOB_ID=7159343
$ oardel 7159343
Deleting the job = 7159343 ...REGISTERED.
The job(s) [ 7159343 ] will be deleted in a near future.

yes, it is possible to include directives in the header with the -S option :

nano test2.sh
#! /bin/bash
#OAR -n bloblo
#OAR -l nodes=1,core=1,walltime=00:01:00
#OAR --project epimed

echo blabla
sleep 10
echo blibli

launched as

oarsub -S "./test2.sh"
bzizou commented 6 years ago

Hi, you can see here our first try: https://github.com/nextflow-io/nextflow/compare/master...bzizou:OAR_executor

The problem that we have now, is that OAR absolutely needs the batch script submitted to be executable (+x mode). We haven't found a solution for doing this change mode efficiently before the submission.

pditommaso commented 6 years ago

Very good. My suggestion is that the OAR specifies the jobs requirements putting them as meta directives in the script header.

The nextflow executor mechanism needs to create two files for each job: .command.sh is the command task as provided by the user in the process definition, .command.run is the launcher script that will manage the execution with OAR batch scheduler.

To implement the support for OAR follow the steps:

  1. For this project and work and clone in your computer
  2. Create a new class nextflow.executor.OarExecutor that extends AbstractGridExecutor.
  3. Register the new executor type in the executorsMap as oar.
  4. Implement the OarExecutor methods using the appropriate OAR commands and directives. You can use the SgeExecutor as a reference.
  5. Implement the unit tests in the class OarExecutorTest, see SgeExeuctorTest as an example.
  6. In order to setup the development environment, compile and run it, see the README file.
  7. Commit your changes and open a PR request if you want to contribute the executor in the main code.

Following these steps, the implementation should be straightforward having a basic knowledge of groovy or java. I'm happy to help or discuss further any problem and detail while the implementation progress.

pditommaso commented 6 years ago

you can see here our first try

Nice! please open a pull request, so I can comment in the code.

pditommaso commented 6 years ago

The problem that we have now, is that OAR absolutely needs the batch script submitted to be executable (+x mode).

In the getSubmitCommandLine you can change the script permissions as shown below:

scriptFile.setPermissions(7,0,0)
Tintest commented 6 years ago

Hello Paolo,

With @bzizou we added the setPermissions and modified the parseJobId fonction to correctly work with OAR.

There is no error left within the .nextflow.log file, but in the corresponding .command.log file there is : /bin/bash: .command.run: command not found

I tried to hardcode a ./ in front of .command.run to make it "executable" but same result.

Here is the output of an OAR nextflow process output :

oarstat -fj 7224138
Job_Id: 7224138
    job_array_id = 7224138
    job_array_index = 1
    name = nf-fastq2sorted
    project = epimed
    owner = tintest
    state = Terminated
    wanted_resources = -l "{type = 'default'}/resource_id=1,walltime=2:0:0"
    types =
    dependencies =
    assigned_resources = 737
    assigned_hostnames = luke37
    queue = default
    command = .command.run
    exit_code = 32512 (127,0,0)
    launchingDirectory = /home/tintest/PROJECTS/Test_nextflow_OAR
    stdout_file = /home/tintest/PROJECTS/Test_nextflow_OAR/work/1a/5b6f49cf53fb1f0e866f68cfb4e5ea/.command.log
    stderr_file = /home/tintest/PROJECTS/Test_nextflow_OAR/work/1a/5b6f49cf53fb1f0e866f68cfb4e5ea/.command.log
    jobType = PASSIVE
    properties = ((desktop_computing = 'NO') AND (team='epimed' or team='ciment' or team='visu')) AND visu = 'NO'
    reservation = None
    walltime = 2:0:0
    submissionTime = 2018-05-23 13:51:03
    startTime = 2018-05-23 13:51:11
    stopTime = 2018-05-23 13:51:12
    cpuset_name = tintest_7224138
    initial_request = oarsub -S -n nf-fastq2sorted .command.run; #OAR -n nf-fastq2sorte; #OAR -O /home/tintest/PROJECTS/Test_nextflow_OAR/work/1a/5b6f49cf53fb1f0e866f68cfb4e5ea/.command.lo; #OAR -E /home/tintest/PROJECTS/Test_nextflow_OAR/work/1a/5b6f49cf53fb1f0e866f68cfb4e5ea/.command.lo; #OAR -q defaul; #OAR --project epime
    message = R=1,W=2:0:0,J=B,N=nf-fastq2sorted,P=epimed (Karma=0.000,quota_ok)
    scheduledStart = no prediction
    resubmit_job_id = 0
    events =
2018-05-23 13:51:12> SWITCH_INTO_TERMINATE_STATE:[bipbip 7224138] Ask to change the job state

Thank you.

pditommaso commented 6 years ago

I guess this is the problem

launchingDirectory = /home/tintest/PROJECTS/Test_nextflow_OAR

It should be

/home/tintest/PROJECTS/Test_nextflow_OAR/work/1a/5b6f49cf53fb1f0e866f68cfb4e5ea/

Have you specified the work --directory in the directives ?

Tintest commented 6 years ago

No I did not have specified an hardcoded --directory.

It's weird because OAR guess the .command.log right, it should be also right for the launchingDirectory, isn't it ?

Tintest commented 6 years ago

In the OAR manual : -d, --directory=<dir> Specify the directory where to launch the command (default is current directory)

So we have to specify it, where should I do it in the executor code ?

pditommaso commented 6 years ago

As shown here.

Tintest commented 6 years ago

So I did specify the -d option but, still the same error, it does not find .command.run, which is clearly weird.

So i decided to try to force it by specifying the workDir in the oarsub by doing :

[ 'oarsub', '-S', '-n', getJobNameFor(task), task.workDir + '/' + scriptFile.getName() ]

Which is dirty ... but, I don't know why, the slash is just completely ignored, I got the following error code : oarsub -S -n nf-fastq2sorted /home/tintest/PROJECTS/Test_nextflow_OAR/work/95/4e71c80e755f57a5922b66cb220c96.command.run

Any idea for a cleaner solution ?

Thank you.

pditommaso commented 6 years ago

This path looks broken /home/tintest/PROJECTS/Test_nextflow_OAR/work/95/4e71c80e755f57a5922b66cb220c96.command.run

there should be a / before .command.run.

Also the -d should work. Make sure it's included in the .command.run header. Take in consideration you debug that script just changing in the task work dir and submit the job, ie.

cd /home/tintest/PROJECTS/Test_nextflow_OAR/work/95/4e71c80e755f57a5922b66cb220c96
oarsub -S .command.run
Tintest commented 6 years ago

Hello,

I know the path looks broken, I may have been poorly expressed myself, I tried to add a / to the path within the getSubmitCommandLine function, by doing this [ 'oarsub', '-S', '-n', getJobNameFor(task), task.workDir + '/' + scriptFile.getName() ] but it's like the / is ignored :

oarsub -S -n nf-fastq2sorted /home/tintest/PROJECTS/Test_nextflow_OAR/work/95/4e71c80e755f57a5922b66cb220c96.command.run

Well I removed the / and here is the header of a .command.run with the -d option set up :

#OAR -n nf-fastq2sorted
#OAR -O /home/tintest/PROJECTS/Test_nextflow_OAR/work/2d/579ec983a80bee7f4b61067bf55044/.command.log
#OAR -E /home/tintest/PROJECTS/Test_nextflow_OAR/work/2d/579ec983a80bee7f4b61067bf55044/.command.log
#OAR -d /home/tintest/PROJECTS/Test_nextflow_OAR/work/2d/579ec983a80bee7f4b61067bf55044
#OAR -q default
#OAR --project epimed
cd /home/tintest/PROJECTS/Test_nextflow_OAR/work/2d/579ec983a80bee7f4b61067bf55044

# NEXTFLOW TASK: fastq2sortedbam (1)

Everything looks fine to me, but still : /bin/bash: .command.run: command not found

Thank you.

pditommaso commented 6 years ago

Have you tried to run the job just using the command oarsub -S .command.run from a shell terminal ?

Tintest commented 6 years ago

I just did it. oarsub -S .command.run seems to do no work, but oarsub -S ./.command.run seems to work (the job is waiting).

pditommaso commented 6 years ago

Oh, so it looks a OAR issue. Anyhow if the latter works, the following should work as well:

    List<String> getSubmitCommandLine(TaskRun task, Path scriptFile ) {
        return [ 'oarsub', '-S', "./${scriptFile.getName()}"  ]
    }

(the name is specified with a directive in the script, therefore shouldn't be needed)

Tintest commented 6 years ago

OK ! Seems to work, now I got some error relative to my poor adpation of my code on this new cluster.

Thank you a thousand times ! I cannot tell you right now it's working but it's definitely a great improvement. I'll tell you if I got some other problems or if everything is working in a few days I will do a pull request 👍

Thank you again.

pditommaso commented 6 years ago

Any progress on this?

Tintest commented 6 years ago

I'm busy on a side project since the beginning of the week. I should go back on OAR next week :)

pditommaso commented 6 years ago

Great! No hurry, just curious about the status of this.

Tintest commented 6 years ago

Hello,

I'm back on OAR.

So the job is correctly scheduled, but nextflow is sending the OAR option as only one string, with the following syntaxe in my nextflow config file :

process {
  executor='oar'
  queue='default'
  clusterOptions = '--project epimed' -l /core=16,walltime=00:30:00'
}

But OAR is expecting several string for each parameter. Could you tell me how to fix that ?

Thank you.

pditommaso commented 6 years ago

Push your code and link it here, please.

Tintest commented 6 years ago

Here it is : https://github.com/Tintest/nextflow/tree/OAR_executor

pditommaso commented 6 years ago

The clusterOptions is added here.

As you can see it's added as it specified by user. What you need to do is to split that string in tokens and add each of them to the result list.

To split that string use the splitter helper that takes care to keep together value enclosed within quote characters (however you need to verify that this will be compatible with the syntax expected by OAR).

Another option is to provide clusterOptions as a Listobject, therefore you can just it to the result object.

Tintest commented 6 years ago

Here we are,

I finally had some time to finalize this work. I did use the .tokenize() function with a semicollon separator because the OAR syntax is quite complicated and the semicollon seems to be a "banned" character.

I removed some of the preexistent options, everything will be specified by ClusterOptions, it's more convenient for me.

Tell me if it seems ok for you (https://github.com/Tintest/nextflow/tree/OAR_executor).

I never did a pull request, which branch should I choose ? Thank you for your help.

pditommaso commented 6 years ago

Nice! quite easy, push the latest changes then in your GitHub fork page you will find a big button "create a new pull request", as simple as that.

Tintest commented 6 years ago

Yes, quite easy indeed, but I can only select the Bzizou's nextflow fork. Is that ok ? And then he will have to do a pull request as well ?

pditommaso commented 6 years ago

Oh, that's because you have forked another fork, not the main project.

However when you open the pull request there's a combo box, select base fork: nextflow-io/nextflow, base: master.

Tintest commented 6 years ago

Done !

pditommaso commented 6 years ago

Well, not sure to who you have send it :)

There isn't in the NF repos: https://github.com/nextflow-io/nextflow/pulls

pditommaso commented 6 years ago

It looks you have created it in your own fork https://github.com/Tintest/nextflow/pull/1

Tintest commented 6 years ago

Should be ok now ... It was more difficult than expected because my branch was behind and I'm a github newbie :)

pditommaso commented 5 years ago

As for #766, I'm closing this because it looks stalled.