gt1 / biobambam2

Tools for early stage alignment file processing
Other
93 stars 17 forks source link

bammarkduplicates vs bammarkduplicates2 documentation #55

Open jeffbhasin opened 6 years ago

jeffbhasin commented 6 years ago

Hello German, I was looking for documentation about the difference between bammarkduplicates and bammarkduplicates2 and did not see any on the help pages for those respective tools. Is there some description of this difference?

Kind regards, Jeff

gt1 commented 6 years ago

Hello Jeff,

the difference between these two tools is purely technical, the output should be equivalent. bammarkduplices2 was designed to work more in memory than on disk.

Best, German

jeffbhasin commented 6 years ago

Hello German, Thank you for the information. I ran bammarkduplicates2 and Picard Tools MarkDuplicates on some of the same RNA-seq samples. There were not the same duplicate calls between the two programs. Is this expected? Is there a difference in how biobambam vs Picard call duplicates?

Thanks, Jeff

gt1 commented 6 years ago

Hello Jeff,

both bammarkduplicates2 and Picard's MarkDuplicates perform duplicate marking by finding read pairs mapping in the same way to a reference. For a set of pairs mapping in the same way the pair not marked as a duplicate is selected using a score computed by using the base qualities of the reads involved. This score can be identical for some pairs, so any of them could be the "best" one. This leaves room for a divergence between the two tools, so different output is possible for this case. Different output in other cases may be a bug, so if you encounter it, please report it.

Best, German