odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
61 stars 9 forks source link

Silent imputation of missing variants #51

Open JosephLalli opened 1 year ago

JosephLalli commented 1 year ago

Hello,

As an experiment, I've tried take the vcf of the draft human pangenome, use bcftools +setGT to unphase the pangenome vcf, and then rephase the vcf using different reference panels. I am using shapeit5's switch tool to assess phasing accuracy. I am using the re-phased dataset as estimated vcf, and the original pangenome vcf as the verification vcf. The advantage of this method is the elimination of genotyping as a source of error, as the input vcf was generated from the verification vcf.

Switch, however, is detecting sporatic genotyping errors. These errors are occurring at sites with a ./1 genotype.

Variants with a combined indel/snp, like so:

Ref:
AATCGTCTGTC
Sample:
AA------GTC
AATTGTCTGTC

After using bcftools norm to split multiallelic sites, the pangenome VCF represents this region as: chr1 2 ATTGTCT A 1|0 chr1 4 T C .|1

Shapeit seems to be interpreting the .|1 call as a missing allele, and imputing the genotype at the site as 0/0. I don't think that is functioning as intented.

What is the best way to handle sites like this? If I used the atomize option of bcftools norm, I could represent the deletion allele as "*". Would shapeit recognize this?

Thanks!

hudja commented 1 year ago

Hello, a similar question here. Is it possible to force shapeit not to impute missing variants?

JosephLalli commented 1 year ago

To clarify, I underestimate that imputation is a necessary part of the phasing process. I think it would be helpful if the SHAPEIT team provided one small feature, and some guidance:

1) A FORMAT tag to indicate when a missing allele has been imputed (ideally with some sort of imputation quality metric, but I understand that would be difficult if such a metric isn't already being measured internally)

2) A recommended method of handling sites when one allele is in an already represented deletion. This is often encountered when one allele is a structural variant. If there is no recommended method, in general how would you recommend we use shapeit to phase structural variants?