Potentially serious problem with read group definitions

uclahs-cds / pipeline-align-DNA

Nextflow pipeline to align paired DNA FASTQs and sort, mark duplicates, and index resulting alignment

https://uclahs-cds.github.io/pipeline-align-DNA/

GNU General Public License v2.0

4 stars 1 forks source link

Potentially serious problem with read group definitions #308

Open sorelfitzgibbon opened 5 months ago

sorelfitzgibbon commented 5 months ago

https://github.com/uclahs-cds/pipeline-align-DNA/blob/76c63a162c81b26bc1aaf714bbd35f03c491579b/main.nf#L71

Library, sample and lane are not sufficient to identify unique read groups. The same sample from the same library can be sequenced on two different machine runs and have the same lane. In this case the pipeline yields an error, but the user is likely to attempt to get around it by renaming the libraries to distinct names. This can lead to massive numbers of false positives, dependent on background duplication rate, due to failure to mark duplicates across the two runs.

sorelfitzgibbon commented 5 months ago

Additionally, for my metapipeline run with this situation it only failed after having aligned the normal and 4 of 5 tumor FASTQs, whereas I assume it would be trivial to check for this early in the run.

yashpatel6 commented 5 months ago

This is a good point; with some discussion, we'll want to add something like:

Try to automatically extract the run identifier from the FASTQ header
If it fails, print out a warning and tack on a counter to the lane to distinguish the files

tyamaguchi-ucla commented 5 months ago

Yes, this is what dataset registration is trying to mitigate (only internally). There are many different FASTQ formats and it's hard to accommodate everything but it would be useful to add some checks/automation. Also, this pipeline adds .Seq# to the RG to make it unique so it's unlikely to cause serious issues as long as the library information is correctly provided.

yashpatel6 commented 5 months ago

Generally, yes, the error align-DNA ends up running into is a filename collision since the initial alignments are named based on a combination of library and lane

tyamaguchi-ucla commented 5 months ago

I think you meant this line - https://github.com/uclahs-cds/pipeline-align-DNA/blob/a6769c332aeaade8a2128d402e1e0bc34638eda2/module/align_DNA_BWA_MEM2.nf#L61 (also for HISAT2)

Then, yes, this should be updated. Internally registered FASTQs should be fine as they should have unique lane info for each library.