nextflow-io / patterns

A curated collection of Nextflow implementation patterns
http://nextflow-io.github.io/patterns/
MIT License
332 stars 71 forks source link

Adds pattern describing how to publish files from channel using collectFile #12

Closed winni2k closed 1 year ago

pditommaso commented 6 years ago

Thanks for this contribution, however my understanding is that this a variation of collect-into-file pattern, to which is added the storeDir parameter, isn't it?

In that case I would just add a note to the collect-into-file pattern.

winni2k commented 6 years ago

Maybe. If so, then the problem statement of collect-into-file is too narrow and would need adjusting. I think a code example would also be helpful.

All I can say is that as a new user of nextflow, I did not find that the collect-into-file pattern, or any pattern, helped me figure out how to publish files from a channel into a directory 🤷‍♀️

Perhaps an example in the documentation showing the use of collectFile with storeDir might be sufficient?

winni2k commented 6 years ago

After reading a bit more around the storeDir directive, it looks like what I really want is a publishDir method so that I could write something like unzipped_ch.publishDir('unzipped_files'). Could that be done?

pditommaso commented 6 years ago

Sorry for the late reply. I had a summer break.

I'm still a bit lost in this thread, my understanding is that you want a process outputs to be stored in a specific directory. The best way to achieve that is to use the publishDir directive in the process definition, eventually using a pattern to filter only specific files.

However in this PR you are suggesting to use a collectFile operator, that could be used for that, but it's more suggested to collect multiple files into a single one.

winni2k commented 6 years ago

First of all: I am a new user of nextflow, so it's likely that I am confused.

What I would like to is to apply arbitrary transformations to the files in a channel, and then to just publish the result to a directory without building a process.

I think this works:

Channel.fromPath('reads/*_1.fq.gz')
    .filter()
    .another_transformation()
    .set {final_output_channel}

process dump_final_output_channel {
    publishDir 'my_results'
    input: 
    file output_file from final_output_channel

    "echo ignore this message"
}

But I find this cleaner:

Channel.fromPath('reads/*_1.fq.gz')
    .filter()
    .another_transformation()
    .collectFile(storeDir: 'my_results')

I would find this even cleaner:

Channel.fromPath('reads/*_1.fq.gz')
    .filter()
    .another_transformation()
    .storeDir('my_results')
pditommaso commented 6 years ago

I see, being so the second snippet is the way to go

Channel.fromPath('reads/*_1.fq.gz')
    .filter()
    .another_transformation()
    .collectFile(storeDir: 'my_results')

In this case if you want to contribute a patten, you need to clarify you want to store the result of chain of operators (not the direct output of a process)

winni2k commented 6 years ago

Excellent! I think we're getting somewhere. Let me revise this PR...

winni2k commented 4 years ago

@jimhavrilla that's great! I forgot that this PR was still open!

winni2k commented 4 years ago

Actually, @jimhavrilla could you perhaps post a snippet of the code that you ended up writing? Perhaps we could use it in this pattern.

jimhavrilla commented 4 years ago

Sure, super rough, but:

    process filter {
        executor='sge'
        queue='all.q'
        clusterOptions = '-V -cwd -l virtual_free=8G -S /bin/bash'

        tag "${chr}"

        input:
        set chr, sample, bgen from bgen_ch

        output:
        file "${chr}filter.pgen" into pgen_ch
        file "${chr}filter.pvar" into pvar_ch
        file "${chr}filter.psam" into psam_ch

        shell:
        '''
        plink2 --bgen !{bgen} ref-first --sample !{sample} --keep keep.txt
--remove remove.txt --make-pgen --out !{chr}filter --memory 8000
        '''
        }

        pgen_ch
            .collectFile(storeDir: storepath)
        pvar_ch
            .collectFile(storeDir: storepath)
        psam_ch
            .collectFile(storeDir: storepath)

On Fri, Jun 26, 2020 at 6:57 AM Winni Kretzschmar notifications@github.com wrote:

Actually, @jimhavrilla https://github.com/jimhavrilla could you perhaps post a snippet of the code that you ended up writing? Perhaps we could use it in this pattern.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextflow-io/patterns/pull/12#issuecomment-650118842, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBBVFKRGMOUKTHSNY4DRYR5KPANCNFSM4FOVJZCA .

dstrib commented 4 years ago

Quick question, why is the Channel.collectFile() parameter called "storeDir" and not "publishDir"? In process directives, the "publishDir" process directive is for saving output files, and the "storeDir" directive is for permanent caching of files. I was able to accomplish my goal of "publishing" my files from collectFile using the "storeDir" parameter, but it took me a bit to figure out to use this parameter because I was expecting "storeDir" not to be used for final output of files.

bentsherman commented 1 year ago

Since the storeDir option is documented in the Nextflow docs, and since we're trying to move away from publishing outputs in the dataflow logic, I will close this one. We are working on a better way to define workflow inputs and outputs, until then, I think the Nextflow docs are sufficient.