task.workDir is null before the within-quotes script code

Jay-uu commented 12 months ago

Bug report

Hi! Thanks for all your work making Nextflow! I'm developing a pipeline for genome analysis and have encountered a need for being able to do local checks within a process before an if-else script block. My issue is related to #2628 and #3962, but I hope it's different enough to warrant opening a new issue.

It seems to me that the task.workDir isn't initialised until after one of the script sections start. This causes a problem when I want to interact with the input before choosing what code to run.

Expected behavior and actual behavior

Expected behaviour: When I have a script block defined, any code within it is executed within or has access to the task.workDir. Actual behaviour: If a conditional script block is used, any groovy code outside the quotes is executed within the launchDir, and the workDir doesn't exist yet.

Steps to reproduce the problem


process mOTUs_to_pangenome {
    debug true
    input:
    path(mOTU_dir)
    shell:
    bin_list = files(task.workDir/mOTU_dir+"/*.fa") //I want this to be able to read the files from the input, but task.workDir is **null**.
    c = bin_list.size()
    single_bin = bin_list[0]
    if( c > 1 )
    '''
    #bash code
    echo "hey how come your mom lets you have two files?"
    '''

    else
    '''
    #!/usr/bin/env python
    #python code!
    print("ah, just one file I see")
    '''
}

 workflow {
    pg_dir = Channel.fromPath("/home/jay/c__Cyanobacteriia_mOTU_0", type: "dir", checkIfExists: true)
    mOTUs_to_pangenome(pg_dir)
}

Program output


ERROR ~ Error executing process > 'mOTUs_to_pangenome (1)'

Caused by:
  Cannot invoke method div() on null object -- Check script 'issue.nf' at line: 6

Environment

Nextflow version: 23.04.3 build 5875
Operating system: Linux
Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)

Additional context

(Add any other context about the problem here) taskDir_null.nextflow.log

bentsherman commented 12 months ago

I think this is because the task work dir depends on the task hash, which is not computed until after the task script is evaluated.

But you don't need to do this because mOTU_dir is already a path inside the task directory, so just resolve against that variable:

bin_list = files(mOTU_dir.resolve("*.fa"))

Jay-uu commented 11 months ago

Thanks so much! I saw some mentions about using resolve in other places but couldn't figure out the syntax.

Edit: I tried it, but bin_list shows as empty. From the log file I can see that it doesn't seem to look for the input in the right place: nextflow.Nextflow - No such file or directory: c__Cyanobacteriia_mOTU_0/ -- Skipping visit log.nextflow.log

If I move the input direcotry so that it's present in the launchDir it runs, but since I want to have a process in the middle of a pipeline this is of course not a valid solution.

Jay-uu commented 11 months ago

Hi again! To make it easier for you to test it I'm sending the updated code. The input directory is located in a different directory than the pipeline script, and has a number of .fa files. Trying to resolve against the input variable no longer causes an error, but also does not correctly read the input file if it's not present in the launch directory. The screenshots show the output of 1. When the input directory is located in a different location than the launch directory and 2. When the input directory is located in the same location as the launch directory.

process mOTUs_to_pangenome {
    debug true
    input:
    path(mOTU_dir)
    shell:
    println("Checking number of bins")
    bin_list = files(mOTU_dir.resolve("*.fa")) //Runs but no result. Log file: No such file or directory: <input name>/ -- Skipping visit
    //bin_list = files(task.workDir.resolve("*.fa")) //Causes error: Cannot invoke method resolve() on null object
    println("Present bins:")
    println(bin_list)
    c = bin_list.size()
    single_bin = bin_list[0]
    println("single_bin variable is:")
    println(single_bin)

    if( c > 1 )
    '''
    #bash code
    echo "hey how come your mom lets you have two files?"
    '''

    else
    '''
    #!/usr/bin/env python
    #python code!
    print("ah, just one file I see")
    '''
}

 workflow {
    pg_dir = Channel.fromPath("/home/jay/c__Cyanobacteriia_mOTU_0", type: "dir", checkIfExists: true)
    mOTUs_to_pangenome(pg_dir)
}

Program output when input is in a different directory than the pipeline script:

N E X T F L O W  ~  version 23.04.3
Launching `issue.nf` [distraught_mandelbrot] DSL2 - revision: 177c3cbb16
executor >  local (1)
[f5/40124d] process > mOTUs_to_pangenome (1) [100%] 1 of 1 ✔
Checking number of bins
Present bins:
[]
single_bin variable is:
null
ah, just one file I see

Program output when the input is in the same directory as the pipeline script:

N E X T F L O W  ~  version 23.04.3
Launching `issue.nf` [desperate_monod] DSL2 - revision: 177c3cbb16
executor >  local (1)
[26/4b7ad4] process > mOTUs_to_pangenome (1) [100%] 1 of 1 ✔
Checking number of bins
Present bins:
[c__Cyanobacteriia_mOTU_0/mock1.maxbin.006.fasta.contigs.fa, c__Cyanobacteriia_mOTU_0/test.fa]
single_bin variable is:
c__Cyanobacteriia_mOTU_0/mock1.maxbin.006.fasta.contigs.fa
hey how come your mom lets you have two files?

bentsherman commented 11 months ago

I think a better approach here would be something like this:

process mOTUs_to_pangenome {
    debug true
    input:
    path(bin_list, arity: '1..*')
    shell:
    println("Present bins:")
    println(bin_list)
    c = bin_list.size()
    single_bin = bin_list[0]
    println("single_bin variable is:")
    println(single_bin)

    if( c > 1 )
    '''
    #bash code
    echo "hey how come your mom lets you have two files?"
    '''

    else
    '''
    #!/usr/bin/env python
    #python code!
    print("ah, just one file I see")
    '''
}

 workflow {
    pg_files = Channel.fromPath("/home/jay/c__Cyanobacteriia_mOTU_0/*.fa", checkIfExists: true)
    mOTUs_to_pangenome(pg_files)
}

Basically, perform the glob outside of the process. I also used the new arity option so that the files are always a list.

bentsherman commented 11 months ago

You can take this further by moving the rest of the shell code into channel logic in the workflow, then you could e.g. have two different processes for the different cases, but I will leave that as an exercise for the reader 😄

Jay-uu commented 11 months ago

Thanks for your suggestions! I might decide to modularize the pipeline in the future and have multiple workflows - which would make the workflow check + multiple processes solution a little unstable, but it's definitely something that works now. I'll also keep a look out to see if the arity function makes it into the stable release. Thank you for taking the time to help!

bentsherman commented 11 months ago

BTW the arity feature is available in 23.10.0

nextflow-io / nextflow