Open sorelfitzgibbon opened 7 months ago
Additionally, for my metapipeline
run with this situation it only failed after having aligned the normal
and 4 of 5 tumor
FASTQs, whereas I assume it would be trivial to check for this early in the run.
This is a good point; with some discussion, we'll want to add something like:
lane
to distinguish the filesYes, this is what dataset registration is trying to mitigate (only internally). There are many different FASTQ formats and it's hard to accommodate everything but it would be useful to add some checks/automation. Also, this pipeline adds .Seq#
to the RG
to make it unique so it's unlikely to cause serious issues as long as the library information is correctly provided.
Generally, yes, the error align-DNA ends up running into is a filename collision since the initial alignments are named based on a combination of library and lane
I think you meant this line - https://github.com/uclahs-cds/pipeline-align-DNA/blob/a6769c332aeaade8a2128d402e1e0bc34638eda2/module/align_DNA_BWA_MEM2.nf#L61 (also for HISAT2)
Then, yes, this should be updated. Internally registered FASTQs should be fine as they should have unique lane info for each library.
https://github.com/uclahs-cds/pipeline-align-DNA/blob/76c63a162c81b26bc1aaf714bbd35f03c491579b/main.nf#L71
Library
,sample
andlane
are not sufficient to identify unique read groups. The same sample from the same library can be sequenced on two different machine runs and have the same lane. In this case the pipeline yields an error, but the user is likely to attempt to get around it by renaming the libraries to distinct names. This can lead to massive numbers of false positives, dependent on background duplication rate, due to failure to mark duplicates across the two runs.