odelaneau / GLIMPSE

Low Coverage Calling of Genotypes
MIT License
136 stars 26 forks source link

Per-sample imputation quality score? #155

Open biona001 opened 1 year ago

biona001 commented 1 year ago

Hello,

Thanks for the awesome software!

I wonder is there any way to measure imputation quality on a per-sample basis, from the output of GLIMPSE1? I already filtered out all variants with INFO scores less than 0.7, but I wonder if it is possible that some samples are poorly imputed (e.g. perhaps because their ethnic background does not really match with the reference panel) and thus should be discarded.

srubinacci commented 1 year ago

Hi Benjamin,

Nice e-meeting you - I recall reading about the imputation tool you developed, very nice and original work!

A couple of inputs here: first, I think it is a bit unclear how the INFO score is interpreted in the context of lc-WGS from large reference panels. In practice, I usually never filter my data for INFO score, for example for association testing, but I usually filter the data simply based on MAF. But of course this probably depends on the application. Other studies in the context of aDNA proposed a filter based directly on GP, e.g. (https://www.nature.com/articles/s41598-020-75387-w), but this can introduce a pattern of missingness that is different between samples. To solve this they run SNP array imputation on the filtered sites. While the reasoning behind this is valuable, I don't think that throwing away read-level information will do better than imputing from the reads, and indeed we proceeded differently in our work on aDNA data from very diverse populations (https://www.biorxiv.org/content/10.1101/2022.07.19.500636v1). Finally, a recent work on non-human data suggested that the INFO score might not work very well in the case of lc-WGS: https://gsejournal.biomedcentral.com/articles/10.1186/s12711-023-00809-y. I can only give you pointers as I don't think there's a definitive answer emerging.

Going to your question: I am not aware of sample-level filters available so far in this context, but indeed the pattern of GP and heterozygousity are a good indicator of how well the imputation run. However, this of course depends on the original sequencing coverage, introducing some complexity.

Hope this helps,

Simone

biona001 commented 1 year ago

Hi Simone,

Very nice to e-meet you too, I'm flattered to hear you've read my paper.

Just to confirm, after imputation, you are suggesting me to filter variants based on MAF regardless of their INFO scores? I thought INFO score indicates imputation quality, so why wouldn't you remove poorly imputed variants? My intended application is indeed association testing.

Based on your tips/information, for sample level filters, maybe something like an averaged Gini index could be useful, e.g. Gini(SNP) = sum_i pi^2 where pi is the probability for 0/0, 0/1, and 1/1. So the best Gini score for a variant is 1, and worst is 1/3. Maybe I can check the average Gini score over all variants within a subject, and define some arbitrary cutoff, I'm not sure yet but I'll give it a try.

ben