sanger-tol / readmapping

Nextflow DSL2 pipeline to align short and long reads to genome assembly. This workflow is part of the Tree of Life production suite.
https://pipelines.tol.sanger.ac.uk/readmapping
MIT License
11 stars 6 forks source link

Added branched handling of ULI inputs in filter_pacbio #115

Open tkchafin opened 2 months ago

tkchafin commented 2 months ago

Ultra low-input libraries (tracked in the "library" samplesheet column) will now be run through pbmarkdup. Note nothing is removed in the test file, but I have marked the PB cram as "uli" to trigger the test

Closes https://github.com/sanger-tol/readmapping/issues/72

PR checklist

muffato commented 2 months ago

Shane first runs lima on a database of ULI adapters, and then pbmarkdup

for ULI data, we need to run and extra lima to trim the ULI adapter sequence

https://github.com/sanger-tol/tol-workflows/blob/main/wr/wr-import-pacbio-ccs#L323-L338

Do we need lima here too ?

tkchafin commented 2 months ago

For Sanger data, this will already have been done (actually, mark/rm duplicates is done as well), so technically I think we can treat ULI reads the same as LI/other prep types for production purposes.

For full ULI support for external data, special handling of adapter trimming makes sense, although the pipeline as-is generally assumes most read filtering/qc has been done prior to running. Maybe we could think about adding an optional sub workflow to take in raw data?

tkchafin commented 2 months ago

@reichan1998 Can you review? I am tracking the lima/adapter removal suggestion in a separate ticket on pre-alignment QC, but for now we can merge the pbmarkdup integration if it is all working