Open cjfields opened 3 years ago
I've time-boxed a day for figuring out how I'd do it. Here is a plan with a few commands that might be helpful to someone if a migration is attempted.
The logic is duplicated across a few files and it's not obvious how to put them together. Handily, there are comments in the code for each step, and the execution flows from top to bottom.
This searches for the comments and notes down the lines they are at, so the files can be split into logical chunks:
cat <( grep -n 'env nextflow\| * Step' main.nf pacbio.nf loop.nf | perl -pe 's{#!/usr/bin/env nextflow}{ * Step 0: Start}' ) <( wc -l main.nf pacbio.nf loop.nf | grep -v total | perl -pe 's{ (\d+) (.*)}{$2:$1: * Step 11: End}' )
main.nf:1: * Step 0: Start
main.nf:183: * Step 1: Filter and trim (run per sample?)
main.nf:467: * Step 2: Learn error rates (run on all samples)
main.nf:564: * Step 3: Dereplication, Sample Inference, Merge Pairs
main.nf:570: * Step 4: Construct sequence table
main.nf:755: * Step 8: Remove chimeras
main.nf:799: * Step 9: Taxonomic assignment
main.nf:992: * Step 8.5: Rename ASVs
main.nf:1181: * Step 10: Align and construct phylogenetic tree
main.nf:1188: * Step 10a: Alignment
main.nf:1259: * Step 10b: Construct phylogenetic tree
main.nf:1386: * Step 10: Track reads
pacbio.nf:1: * Step 0: Start
pacbio.nf:173: * Step 1: Filter and trim (run per sample?)
pacbio.nf:331: * Step 2: Learn error rates (run on all samples)
pacbio.nf:377: * Step 3: Dereplication, Sample Inference, Merge Pairs
pacbio.nf:383: * Step 4: Construct sequence table
pacbio.nf:449: * Step 8: Remove chimeras
pacbio.nf:484: * Step 9: Taxonomic assignment
pacbio.nf:673: * Step 8.5: Rename ASVs
pacbio.nf:845: * Step 10: Align and construct phylogenetic tree
pacbio.nf:852: * Step 10a: Alignment
pacbio.nf:923: * Step 10b: Construct phylogenetic tree
pacbio.nf:1050: * Step 10: Track reads
loop.nf:1: * Step 0: Start
loop.nf:291: * Step 2: Learn error rates (run on all samples)
loop.nf:336: * Step 3: Dereplication, Sample Inference, Merge Pairs
loop.nf:342: * Step 4: Construct sequence table
loop.nf:405: * Step 8: Remove chimeras
loop.nf:440: * Step 9: Taxonomic assignment
loop.nf:629: * Step 8.5: Rename ASVs
loop.nf:803: * Step 10: Align and construct phylogenetic tree
loop.nf:810: * Step 10a: Alignment
loop.nf:881: * Step 10b: Construct phylogenetic tree
loop.nf:1008: * Step 10: Track reads
main.nf:1637: * Step 11: End
pacbio.nf:1289: * Step 11: End
loop.nf:1249: * Step 11: End
Here's how to pick a section from each file corresponding to filtering and trimming, based on line numbers above:
cat \
<( perl -ne 'print if $. >=183 && $. < 467' main.nf ) \
<( perl -ne 'print if $. >=173 && $. < 331' pacbio.nf ) \
<( perl -ne 'print if $. >=1 && $. < 291' loop.nf ) \
> filterAndTrim.nf
Remove all the from
from processes. Replace all the into
with emit, naming the output after the first into
channel.
In vim
:
%s/ from.*//
%! perl -pe 'if(/ into/){s{ into}{, emit:}; s{(.*emit:.*?),.*}{$1};}'
Go through the new file manually. Remove commented out code, and any if()s - leave just the process definitions.
Since the from
and into
are gone from DSL2, how processes connect up is defined somewhere else. Do that, just for the module.
I found it helpful to have two files open next to each other - the common one, and the original ones that had the flow - and search for process names because they didn't change. For the 16s workflow, I've ended up with:
workflow filterAndTrim16sPaired {
take: reads
main:
readsPaired = Channel.fromFilePairs( reads )
runFastQC(readsPaired) | runMultiQC()
filterAndTrim(readsPaired)
runFastQC_postfilterandtrim(filterAndTrim.out.filteredReadsforQC) | runMultiQC_postfilterandtrim()
mergeTrimmedTable(filterAndTrim.out.trimTracking)
emit:
filterAndTrim.out
mergeTrimmedTable.out.trimmedReadTracking
}
Since nothing really changes in the individual processes, any breakages should be quite obvious when testing modules one at a time.
The new main.nf
should have just the imports, and put them into workflows corresponding to usage, e.g. 16sPaired, ITSPaired, Loop, PacBio
. Different workflows would then be executed with an -entry
command line argument.
I've set up a preliminary project to start some organization on this. I'd like to start on something in the next few months, on a branch of course, as we have several workflows with similar steps. We should also think about some of the newer considerations with the nf-core DSL2 base and configs, in particular process-specific configs that allow some custom arguments.
We're seeing some fragmentation on workflows due to switches in technologies (PacBio, Shoreline, Loop, etc) that a migration to DSL2 would help tremendously. This is a simple tracker to plot a course forward and note tickets that would benefit from this.