high quality batch level variants resulting in low quality merged variant.

batzler commented 5 days ago

We are finding examples where batch level R2 > 0.3 for all batches but the output R2 from hds-util is < 0.3 and thus the variants do not end up in the output data when using the -m 0.3 option. While this is a very small number of variants (just a few out of 22K HLA imputed variants) we are curious how this can happen in the calculation of the merged R2. Thanks for any explanation you can provide.

Example

hds-util -f GT,HDS,GP,DS -m 0.3 data1.dose.vcf.gz data2.dose.vcf.gz data3.dose.vcf.gz data4.dose.vcf.gz -O vcf.gz > merged.dose.vcf.gz

data1 AA_B_167_31323991_exon3_SW 6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.999851;MAF=0.00014919;AVG_CS=0.999931;R2=0.543911

data2 [batzler@mforgehn5 TAPphase1QCimp]$ zcat v4/HLA/HLAoutput/2/chr6.info.gz | grep AA_B_167_31323991_exon3_SW 6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.99989;MAF=0.000109911;AVG_CS=0.999933;R2=0.402355

data3 6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.999909;MAF=9.11355e-05;AVG_CS=0.999944;R2=0.384272

data4 6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.999825;MAF=0.000174999;AVG_CS=0.999938;R2=0.646454

merged CHROM POS SNP REF ALT N ALT_Frq MAF minRsq minEmpRsq vtype RSQ

1 6 31324002 AA_B_167_31323991_exon3_SW A T 4 1.00 0.000131 0.384 NA AA NA Thanks

jonathonl commented 5 days ago

Can you please proved the INFO fields for the merged variant record as you did for data[1-4]?

batzler commented 5 days ago

Apologies for the confusion. When I ran using "-m 0.3" the hds-util did not provide the output for this variant. I just ran the merge for this variant separately removing the -m filter and this is the result.

6 31324002 AA_B_167_31323991_exon3_SW A T . . IMPUTED;AF=0.99989;MAF=0.00010 9732;AVG_CS=0.999958;R2=0.174244

jonathonl commented 4 days ago

Thanks.

The R2 field is an estimated r-squared based on the mean the HDS values. It's plausible for the chunked data to "fit" their respective means better than the merged data. This is also plausible in empirical r-squared. In an extreme example where the x is unimodal and the y is bimodal due to two underlying distributions with different means, the r-squared of the combined data could be much worse than if you were to compute r-squared on the two underlying distributions separately.

I just double checked that the R2 INFO field was being computed correctly by separating an imputed VCF into two sample sets and then merging back together with hds-util. This will give equivalent results (barring precision errors).

statgen / hds-util

high quality batch level variants resulting in low quality merged variant. #4