vibansal / HapCUT2

software tools for haplotype assembly from sequence data
BSD 2-Clause "Simplified" License
202 stars 35 forks source link

calculate_haplotype_statistics.py slightly differs from block headers #124

Open magnitov opened 2 years ago

magnitov commented 2 years ago

Hi @vibansal, I have a question about the calculate_haplotype_statistics.py script. I noticed that the phased count and num snps max blk reported by the script are different from those in BLOCK headers of my .hap file I use. For instance, if I sum the total number of phased SNVs and check the number of SNVs in the largest block in .hap file, I get slightly different counts as compared to the script output.

If I sum the phased field for all blocks I get the following number: 189701. My largest block header is as following:

BLOCK: offset: 12 len: 189252 phased: 188348 SPAN: 248704444 fragments 663113

However, the output from calculate_haplotype_statistics.py gives the following numbers with -i on:

phased count: 188484 num snps max blk: 188057

I wonder if there is some kind of filter implemented in the script that causes this?

Best, Mikhail