markowetzlab / scDNAseq-workflow

calling absolute copy numbers on single cell DNA sequencing data - stagging repo for public release
4 stars 2 forks source link

Instruction/control of read duplicate marking #21

Closed TBradley27 closed 6 months ago

TBradley27 commented 7 months ago

Hello,

At the minute, I don't think there is any instruction in the documentation, or any control logic in the pipeline to regulate/validate input aligned bam files with respect to whether reads have been marked as duplicates or not

The problem this causes is that a user may pass a bam file through the pipeline in which duplicates have not been marked, which may give them misleading results. Controlling for this could become an additional step to validate user bam files.

If duplicate marking is handled internally, it would be useful if this was documented somewhere

leachim commented 7 months ago

Thanks for bringing that up. I have added a note about this to the documentation. It might be nice to add an example to the FAQ about processing fastq files with Samtools to sort/mark duplicates. Do you know how to mark duplicates with samtools? in my pipeline I use picard if I remember correctly.

TBradley27 commented 6 months ago

Thanks, and apologies for the late reply.

Marking duplicates with samtools can be done using the following command:

samtools markdup -@ {threads} {input.bam} {output.bam}

For what it is worth, I have thought about it a bit more, and think it is very difficult to try and validate this internally without doing unnecessary/redundant work. Because you can't distinguish between a bam in which duplicates have not been marked, and a bam in which duplicates have been marked and subsequently removed. So simply trying to detect the presence of any marked duplicates in the bam would not be appropriate. So I guess it is the responsibility of the user to make sure they follow the instructions in the documentation