Closed winni2k closed 1 year ago
Maybe. If so, then the problem statement of collect-into-file
is too narrow and would need adjusting. I think a code example would also be helpful.
All I can say is that as a new user of nextflow, I did not find that the collect-into-file
pattern, or any pattern, helped me figure out how to publish files from a channel into a directory 🤷♀️
Perhaps an example in the documentation showing the use of collectFile with storeDir
might be sufficient?
After reading a bit more around the storeDir
directive, it looks like what I really want is a publishDir
method so that I could write something like unzipped_ch.publishDir('unzipped_files')
. Could that be done?
Sorry for the late reply. I had a summer break.
I'm still a bit lost in this thread, my understanding is that you want a process outputs to be stored in a specific directory. The best way to achieve that is to use the publishDir
directive in the process definition, eventually using a pattern to filter only specific files.
However in this PR you are suggesting to use a collectFile
operator, that could be used for that, but it's more suggested to collect multiple files into a single one.
First of all: I am a new user of nextflow, so it's likely that I am confused.
What I would like to is to apply arbitrary transformations to the files in a channel, and then to just publish the result to a directory without building a process.
I think this works:
Channel.fromPath('reads/*_1.fq.gz')
.filter()
.another_transformation()
.set {final_output_channel}
process dump_final_output_channel {
publishDir 'my_results'
input:
file output_file from final_output_channel
"echo ignore this message"
}
But I find this cleaner:
Channel.fromPath('reads/*_1.fq.gz')
.filter()
.another_transformation()
.collectFile(storeDir: 'my_results')
I would find this even cleaner:
Channel.fromPath('reads/*_1.fq.gz')
.filter()
.another_transformation()
.storeDir('my_results')
I see, being so the second snippet is the way to go
Channel.fromPath('reads/*_1.fq.gz')
.filter()
.another_transformation()
.collectFile(storeDir: 'my_results')
In this case if you want to contribute a patten, you need to clarify you want to store the result of chain of operators (not the direct output of a process)
Excellent! I think we're getting somewhere. Let me revise this PR...
@jimhavrilla that's great! I forgot that this PR was still open!
Actually, @jimhavrilla could you perhaps post a snippet of the code that you ended up writing? Perhaps we could use it in this pattern.
Sure, super rough, but:
process filter {
executor='sge'
queue='all.q'
clusterOptions = '-V -cwd -l virtual_free=8G -S /bin/bash'
tag "${chr}"
input:
set chr, sample, bgen from bgen_ch
output:
file "${chr}filter.pgen" into pgen_ch
file "${chr}filter.pvar" into pvar_ch
file "${chr}filter.psam" into psam_ch
shell:
'''
plink2 --bgen !{bgen} ref-first --sample !{sample} --keep keep.txt
--remove remove.txt --make-pgen --out !{chr}filter --memory 8000
'''
}
pgen_ch
.collectFile(storeDir: storepath)
pvar_ch
.collectFile(storeDir: storepath)
psam_ch
.collectFile(storeDir: storepath)
On Fri, Jun 26, 2020 at 6:57 AM Winni Kretzschmar notifications@github.com wrote:
Actually, @jimhavrilla https://github.com/jimhavrilla could you perhaps post a snippet of the code that you ended up writing? Perhaps we could use it in this pattern.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextflow-io/patterns/pull/12#issuecomment-650118842, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBBVFKRGMOUKTHSNY4DRYR5KPANCNFSM4FOVJZCA .
Quick question, why is the Channel.collectFile() parameter called "storeDir" and not "publishDir"? In process directives, the "publishDir" process directive is for saving output files, and the "storeDir" directive is for permanent caching of files. I was able to accomplish my goal of "publishing" my files from collectFile using the "storeDir" parameter, but it took me a bit to figure out to use this parameter because I was expecting "storeDir" not to be used for final output of files.
Since the storeDir
option is documented in the Nextflow docs, and since we're trying to move away from publishing outputs in the dataflow logic, I will close this one. We are working on a better way to define workflow inputs and outputs, until then, I think the Nextflow docs are sufficient.
Thanks for this contribution, however my understanding is that this a variation of collect-into-file pattern, to which is added the
storeDir
parameter, isn't it?In that case I would just add a note to the collect-into-file pattern.