odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
61 stars 9 forks source link

phase_rare might be filling INFO/AC incorrectly #9

Closed 23andme-jaredo closed 1 year ago

23andme-jaredo commented 1 year ago

For example try:

../SHAPEIT5_phase_common --input 10k/msprime.nodup.bcf --filter-maf 0.001  --output 10k/msprime.common.phased.bcf --region  1:4000001-5000000 --thread 8 
bcftools index 10k/msprime.common.phased.bcf --threads 8

#chunk1
../SHAPEIT5_phase_rare --input-plain 10k/msprime.nodup.bcf --scaffold 10k/msprime.common.phased.bcf --output 10k/msprime.rare.chr1_1_1500000.bcf --scaffold-region 1:4000000-5000000 --input-region 1:4500000-4600000 --thread 8
bcftools index 10k/msprime.rare.chr1_1_1500000.bcf --threads 8

I am seeing:

[jaredo@download1 test]$ bcftools view -H -G 10k/msprime.rare.chr1_1_1500000.bcf  | head
1   4500006 .   0   1   .   .   AC=446
1   4500015 .   0   1   .   .   AC=2
1   4500042 .   0   1   .   .   AC=8
1   4500111 .   0   1   .   .   AC=1
1   4500157 .   0   1   .   .   AC=1
1   4500174 .   0   1   .   .   AC=1
1   4500191 .   0   1   .   .   AC=4
1   4500388 .   0   1   .   .   AC=1
1   4500394 .   0   1   .   .   AC=1
1   4500474 .   0   1   .   .   AC=1
[jaredo@download1 test]$ bcftools +fill-tags 10k/msprime.rare.chr1_1_1500000.bcf -- -t AC | bcftools view -H -G | head
1   4500006 .   0   1   .   .   AC=223
1   4500015 .   0   1   .   .   AC=2
1   4500042 .   0   1   .   .   AC=8
1   4500111 .   0   1   .   .   AC=1
1   4500157 .   0   1   .   .   AC=1
1   4500174 .   0   1   .   .   AC=1
1   4500191 .   0   1   .   .   AC=4
1   4500388 .   0   1   .   .   AC=1
1   4500394 .   0   1   .   .   AC=1
1   4500474 .   0   1   .   .   AC=1

I think what is happening is that the variants in the scaffold are having their allele count doubled.

Maybe it is some interaction here:

$ find phase_rare/ -name '*.cpp' -exec grep -Hn -i count_alt {} \;
phase_rare//src/io/haplotype_writer.cpp:88:         int count_alt = 0;
phase_rare//src/io/haplotype_writer.cpp:96:                 count_alt += 2 * major_allele;
phase_rare//src/io/haplotype_writer.cpp:104:                    count_alt -= 2 * major_allele;
phase_rare//src/io/haplotype_writer.cpp:105:                    count_alt += a0+a1;
phase_rare//src/io/haplotype_writer.cpp:112:                    count_alt += a0+a1;
phase_rare//src/io/haplotype_writer.cpp:115:                    count_alt += a0+a1;
phase_rare//src/io/haplotype_writer.cpp:119:            bcf_update_info_int32(hdr, rec, "AC", &count_alt, 1);
23andme-jaredo commented 1 year ago

PS. I couldn't see any issue with actual genotypes - the re-computed allele counts are consistent with the original input BCF. So this appears mainly cosmetic.

zyhuang commented 1 year ago

It looks like the issue impacts only the common ones, and removing one of the count_alt on these two lines fixed it. https://github.com/odelaneau/shapeit5/blob/main/phase_rare/src/io/haplotype_writer.cpp#L112-L115

                    count_alt += a0+a1;
                    genotypes[2*i+0] = bcf_gt_phased(a0);
                    genotypes[2*i+1] = bcf_gt_phased(a1);
                    count_alt += a0+a1;
odelaneau commented 1 year ago

Thanks for reporting this. I added the fix.

23andme-jaredo commented 1 year ago

thanks!