h3abionet / TADA

TADA - Targeted Amplicon Diversity Analysis - a DADA2-focused Nextflow workflow for any targeted amplicon region
MIT License
19 stars 14 forks source link

DSL2 migration #14

Open cjfields opened 3 years ago

cjfields commented 3 years ago

We're seeing some fragmentation on workflows due to switches in technologies (PacBio, Shoreline, Loop, etc) that a migration to DSL2 would help tremendously. This is a simple tracker to plot a course forward and note tickets that would benefit from this.

wbazant commented 3 years ago

A possible migration plan

Introduction

I've time-boxed a day for figuring out how I'd do it. Here is a plan with a few commands that might be helpful to someone if a migration is attempted.

Inspect overall structure

The logic is duplicated across a few files and it's not obvious how to put them together. Handily, there are comments in the code for each step, and the execution flows from top to bottom.

This searches for the comments and notes down the lines they are at, so the files can be split into logical chunks:

cat <( grep -n 'env nextflow\| * Step' main.nf pacbio.nf loop.nf | perl -pe 's{#!/usr/bin/env nextflow}{ * Step 0: Start}' ) <(  wc -l main.nf pacbio.nf loop.nf  | grep -v total | perl -pe 's{  (\d+) (.*)}{$2:$1:  * Step 11: End}' )

main.nf:1: * Step 0: Start
main.nf:183: * Step 1: Filter and trim (run per sample?)
main.nf:467: * Step 2: Learn error rates (run on all samples)
main.nf:564: * Step 3: Dereplication, Sample Inference, Merge Pairs
main.nf:570: * Step 4: Construct sequence table
main.nf:755: * Step 8: Remove chimeras
main.nf:799: * Step 9: Taxonomic assignment
main.nf:992: * Step 8.5: Rename ASVs
main.nf:1181: * Step 10: Align and construct phylogenetic tree
main.nf:1188: * Step 10a: Alignment
main.nf:1259:     * Step 10b: Construct phylogenetic tree
main.nf:1386: * Step 10: Track reads
pacbio.nf:1: * Step 0: Start
pacbio.nf:173: * Step 1: Filter and trim (run per sample?)
pacbio.nf:331: * Step 2: Learn error rates (run on all samples)
pacbio.nf:377: * Step 3: Dereplication, Sample Inference, Merge Pairs
pacbio.nf:383: * Step 4: Construct sequence table
pacbio.nf:449: * Step 8: Remove chimeras
pacbio.nf:484: * Step 9: Taxonomic assignment
pacbio.nf:673: * Step 8.5: Rename ASVs
pacbio.nf:845: * Step 10: Align and construct phylogenetic tree
pacbio.nf:852: * Step 10a: Alignment
pacbio.nf:923:     * Step 10b: Construct phylogenetic tree
pacbio.nf:1050: * Step 10: Track reads
loop.nf:1: * Step 0: Start
loop.nf:291: * Step 2: Learn error rates (run on all samples)
loop.nf:336: * Step 3: Dereplication, Sample Inference, Merge Pairs
loop.nf:342: * Step 4: Construct sequence table
loop.nf:405: * Step 8: Remove chimeras
loop.nf:440: * Step 9: Taxonomic assignment
loop.nf:629: * Step 8.5: Rename ASVs
loop.nf:803: * Step 10: Align and construct phylogenetic tree
loop.nf:810: * Step 10a: Alignment
loop.nf:881:     * Step 10b: Construct phylogenetic tree
loop.nf:1008: * Step 10: Track reads
main.nf:1637:  * Step 11: End
pacbio.nf:1289:  * Step 11: End
loop.nf:1249:  * Step 11: End

Split into modules

Here's how to pick a section from each file corresponding to filtering and trimming, based on line numbers above:

cat \
  <( perl -ne 'print if $. >=183 && $. < 467' main.nf   ) \
  <( perl -ne 'print if $. >=173 && $. < 331' pacbio.nf ) \
  <( perl -ne 'print if $. >=1 && $. < 291' loop.nf ) \
 > filterAndTrim.nf

Edit each module

Remove DSL1 bits

Remove all the from from processes. Replace all the into with emit, naming the output after the first into channel. In vim:

%s/ from.*//

%! perl -pe 'if(/ into/){s{ into}{, emit:}; s{(.*emit:.*?),.*}{$1};}'

Clean up

Go through the new file manually. Remove commented out code, and any if()s - leave just the process definitions.

Try to recreate each flow through the code

Since the from and into are gone from DSL2, how processes connect up is defined somewhere else. Do that, just for the module.

I found it helpful to have two files open next to each other - the common one, and the original ones that had the flow - and search for process names because they didn't change. For the 16s workflow, I've ended up with:

workflow filterAndTrim16sPaired {
  take: reads
  main:

  readsPaired = Channel.fromFilePairs( reads )
  runFastQC(readsPaired) | runMultiQC()
  filterAndTrim(readsPaired)
  runFastQC_postfilterandtrim(filterAndTrim.out.filteredReadsforQC) | runMultiQC_postfilterandtrim()
  mergeTrimmedTable(filterAndTrim.out.trimTracking)

  emit:
  filterAndTrim.out
  mergeTrimmedTable.out.trimmedReadTracking
}

Test each module

Since nothing really changes in the individual processes, any breakages should be quite obvious when testing modules one at a time.

Put everything together

The new main.nf should have just the imports, and put them into workflows corresponding to usage, e.g. 16sPaired, ITSPaired, Loop, PacBio. Different workflows would then be executed with an -entry command line argument.

cjfields commented 3 years ago

I've set up a preliminary project to start some organization on this. I'd like to start on something in the next few months, on a branch of course, as we have several workflows with similar steps. We should also think about some of the newer considerations with the nf-core DSL2 base and configs, in particular process-specific configs that allow some custom arguments.