nextflow-io / nf-hack18

Nextflow workshop 2018 -- Training pages -> https://nextflow-io.github.io/nf-hack18/
18 stars 7 forks source link

[HACK TOPIC] Process unit testing #7

Open ewels opened 5 years ago

ewels commented 5 years ago

Notes from our discussion about how a new generic unit testing module could work:

Present: @ewels, @fstrozzi, @LukeGoodsell, @micans

Testing would essentially act as a wrapper, with three steps:

  1. Set up the required files with the storeDir
    • Needs some kind of nextflow syntax to be able to define these test files
  2. Run nextflow with a super lenient caching method (file name only?) {new nf feature} so that it skips all of the upstream steps
    • Super lenient caching will mean that we can have empty files for all upstream steps except the penultimate process
    • Don't have staged cached files for the process that we're interested in. Nextflow will run this process for real.
    • Stop the workflow as soon as this process is done {new nf feature}
      • Squash the output channel
      • Required in case something goes wrong and the output filenames change. If this happens then the cache won't be valid and all downstream steps will run again.
  3. Check the results from that process

This tool could then be run, specifying which process should be tested. It could then be run in parallel for each process of interest.

ewels commented 5 years ago

x-ref https://github.com/nf-core/tools/issues/209

LukeGoodsell commented 5 years ago

My understanding of the user story:

  1. A developer writes a workflow and wishes to add unit tests for one or more processes.

  2. She prepares one or more sets of inputs for the processes being tested.

  3. She then writes a test file that would look something like:

    // First test
    process firstTest {
        // List of processes that are allowed to run
        runProcesses 'bqsr bwa'
    
        // A list of upstream process outputs that should be used
        upstreamOutputs {
            'fastqc': ['testdata/firstTest/input/data_R1.fastq.gz', 'testdata/firstTest/input/data_R2.fastq.gz'],
            'multiqc': ['testdata/firstTest/input/multiqc.html']
        }
    
        // Code to run to test the output
        test:
        '''
        #!/usr/bin/env python
    
        # Insert test code here
        '''
    }
  4. The developer can then either manually run the test, or incorporate into into CI, with a command like

     nextflow test tests/firstTest.nft
LukeGoodsell commented 5 years ago

I have a few suggested ideas for consideration:

  1. Along with a superlenient hash method, we could implement a new process executor, none, that will cause the pipeline to fail if a process is launched that has it as an executor. This will allow us to use a Nextflow config file to assign this as the executor for all processes that should not be executed and thus prevent accidental run-away execution.

  2. Above I suggested a Nextflow-based test script, but I think we can re-use an existing test framework. For example, we could extend Python's unittest classes to create an nfunittest class. This would then have a setup step that generates a nextflow config file and runs the pipeline according to the test specification, and then runs tests of the data in Python. For example:

    import nfunittest
    
    class FirstTest(nfunittest.TestCase):
        def setUp(self):
            self.runProcesses = ['bqsr', 'bwa']
            upstreamOutputs = [
                'fastqc': ['testdata/firstTest/input/data_R1.fastq.gz', 'testdata/firstTest/input/data_R2.fastq.gz'],
                'multiqc': ['testdata/firstTest/input/multiqc.html']
            ]
    
        def test_bwa(self):
            out_file = self.outputDir + '/data.bam'
            ...

    Such test scripts can make use of the large amount of existing test profiling tools and methodologies, rather than writing something new.

  3. We would probably keep the nftest tools/code in a separate repo since it's not written in groovy.

LukeGoodsell commented 5 years ago

I also don't think we're going to be able to get the correct hash name for each process, so prefer the injection of a storeDir in a nextflow config file for all processes. We would, however, need a way for the processes' name to be included in the storeDir directive, and I can't see a way to do that currently with config files. Might need that to be added to Nextflow.

The user would then have to create empty/dummy files for all preceding, unused processes in the form

testdata/my_test/input/[PROCESS_NAME]/[OUTPUT_FILENAME]

The contents of testdata/my_test/input would then be copied to a temporary working directory that will be injected as the storeDir(+ process name) for each process via a config file.

All process that aren't to be executed should then have the none executor (mentioned above) and the test script will run Nextflow. This will:

  1. Allow selective testing of specific processes, using controlled inputs.
  2. Prevent other processes from running.
  3. Require minimal changes to Nextflow.

Thoughts?

ewels commented 5 years ago

Great! Makes a lot of sense :+1:

One nitpick: I love the executor: none idea but maybe nextflow should exit successfully instead of with a failure? This would be more helpful for the test exit status check.

Is there a comparable unit testing framework in java? Nextflow already had unit tests, so I guess there must be. It would be nice to keep this inside nextflow and not a separate program if possible I think.

Phil

ewels commented 5 years ago

Also: instead of telling downstream processes not to run, it could be better to squash the output channels of the selected process that will run. Then we don’t need to know the shape of the DAG before writing the config - nextflow can just pick one process at a time and squash its output channels.

Note that I think executor: none could still be a generally useful thing to have though. This would make it easy for people to write a custom config script that selectively disables parts of other people’s pipelines for example. At the moment we have tonnes of when: !params.skipProcessFoo in a few pipelines which could be removed with this for example.

piotr-faba-ardigen commented 5 years ago

Has anyone given a thought how this looks with DSL-2 being on the table? I'd like to be able to unittest a process in a module

ewels commented 5 years ago

There was quite a bit of discussion around this at the 2019 meeting. However, I've not seen any working examples yet.

sfehrmann commented 1 year ago

just cross-referencing as this popped up in the same search https://code.askimed.com/nf-test/getting-started/

Disclaimer: I had at best a 5 min glimpse at nf-test

ewels commented 1 year ago

Thanks @sfehrmann! This GitHub issue is documenting a Nextflow user meeting from 4 years ago, it's not an issue for active development :) nf-test didn't exist at the time, but you're absolutely right that it's a great tool 👍🏻 So good for future googlers..

schultzm commented 1 year ago

There's also this: https://github.com/LUMC/pytest-workflow