statgen / Minimac4

GNU General Public License v3.0
56 stars 19 forks source link

The imputed vcf has far more genotypes than the target phased vcf and missed rsID. #60

Closed huww1998 closed 1 year ago

huww1998 commented 1 year ago

my target vcf: image imputed vcf: image

Before using minimac v4.1.2, my target vcf file has been phased by SHAPEIT2. I also used --all-typed-sites, but I think it doesn't work. After I checked the target vcf(only included chr 1), it has 19,020 variants. But the imputed vcf has unexpectedly 3,752,110 variants. My Reference Panel is 1000 Genomes Phase 3 downloaded from 1000 Genomes Phase 3 (version 5). I used 1.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz as Reference Panel when I imputed the chr 1. The command are displayed below. image Besides, I also tried to subset the imputed vcf file by using the chr::pos information from my target vcf. But it got 33,019 variants compared with 19,020 variants in the orginal target vcf. The result makes me very confused. I don't know what problem is. Another problem is the rsID missing in the imputed vcf file. Maybe I should set the --sites , --min-r2 or other parameters to solve these problems?

jonathonl commented 1 year ago

Can you clarify what you believe is wrong with the results you are getting? 3,752,110 variants seems reasonable for 1000g chromosome 1. Note that, in addition to imputing missing genotypes, Minimac also imputes variants that exist in the reference panel but not in the target VCF.

jonathonl commented 1 year ago

The ID column in the imputed results comes from the reference panel. The reference panel you are using isn't annotated with rsID's but instead uses {chrom}:{pos} as the identifier.

huww1998 commented 1 year ago

Sorry. Before, I always thought the imputed VCF will only include variants in the target vcf. If Minimac also imputes variants that exist in the reference panel, the imputed VCF maybe right. I also understand why the rsID don't come from target vcf. Thank you very much for your reply. But I can't subset the same number of variants from the imputed VCF by using the chr pos information TXT from my target vcf (I uesd bcftools v1.7). If I want to get the imputed VCF consists of the variants that only exist in the target VCF, am I on the right track?

jonathonl commented 1 year ago

You can achieve this by running bcftools view -i "INFO/TYPED=1" imputed.vcf.gz -Oz -o imputed.typed_variants.vcf.gz.

There can be multiple variant records for a given position, which is why filtering by chrom:position doesn't work.

huww1998 commented 1 year ago

Ok. Thank you very much for your guidance. It's very helpful for me.