lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
482 stars 133 forks source link

Missing read from cmpbams output? #151

Closed lckarssen closed 4 years ago

lckarssen commented 4 years ago

Subject of the issue

When using cmpbams it looks like the output misses some reads. For example, when comparing a CRAM file to itself, I see the read phase lists:

[INFO][CompareBams]. Completed. N=102,908,280. That took:8 minutes

for both files. This is consistent with the output of samtools idxstats. However, if I count number of output lines I get from cmpbams I get 102380853.

Your environment

Steps to reproduce

compbams was run like this:

java -jar ~/tmp/jvarkit/dist/cmpbams.jar -c -F  -R /path/to/reference.fa \
    /path/to/file.cram \
    /path/to/file.cram | wc -l

This happened both for a comparison of two full CRAM files and for a comparison of part of a chromosome.

Expected behaviour

The total number of reads reported by samtools idxstats and/or the read phase of cmpbams should be identical.

Actual behaviour

The two numbers of reads differ (the number of lines in the cmpbams output is lower).

lindenb commented 4 years ago

secondary+supplementary+unmapped are discarded.

lindenb commented 4 years ago

idea: use comm to find the missing reads ?

lckarssen commented 4 years ago

Thanks for the explanation about the missing reads. I had a look at the code and I think unmapped reads are included by default as well (see the if-block starting at line 620).

lindenb commented 4 years ago

As yes, sorry, I wrote this yearsss ago :-)