Imputing population heterogeneous datasets

albaicans commented 3 years ago

Hello,

we are trying to optimize our pipeline of phasing and imputation using Eagle and Minimac4 and the 1000 Genomes reference panel and we would like to know your suggestions regarding the best strategy for imputing heterogeneous datasets.

Our dataset contains individuals with different ancestries in different proportions: most of the individuals have a European ancestry but we also have a smaller group of admixed European-African and even smaller groups of East Asian and African, as well as several individuals with admixed American ancestry. We are interested in using the imputed data in an association analysis including all ancestries.

Our first approach was to phase and impute all samples together but we realized that the imputation accuracy (based on R squared distributions and alternate allele dosages) was not as good as with homogenous datasets. We did some tests imputing different ancestries separately (still using the whole reference panel) and we got better results for the populations with big sample size (European and admixed European-African) but for the populations with small sample size this is not clear. The overall accuracy of the variants that pass the quality filter (R squared > 0.3) is higher if we impute them alone compared to when imputing them together with all the other samples, but we lose about half of the variants, probably because of low MAF that translates to low R squared.

Based on your experience and your knowledge of the imputation algorithm and the calculation of the accuracy, what’s the best approach when phasing/imputing heterogeneous datasets? It looks like we are getting better results when imputing different populations separately but we are not sure how much a small sample size (let’s say 15 individuals) can affect the imputation result and the accuracy estimation.

Thank you in advance!

Best,

Alba

yukt commented 3 years ago

Hi Alba,

Minimac4 imputes each individual haplotype independently, so the imputation result of one sample will not be affected by other samples. The two approaches you mentioned should give you exactly the same results except for R-squared itself.

The R-squared output by minimac4 is an estimate of the imputation accuracy and is calculated based on the imputation dosages of all input samples. R-squared = var(HDS)/(p(1-p)), where p=mean(HDS), HDS is the vector of the haplotype dosages of input samples at the marker. The only difference between the two approaches you mentioned is how they calculate the R-square: the first approach calculates the R-square over the vector of the haplotype dosages of all samples , and the second approach is equivalent to splitting the vector into pieces according to the ancestry of the samples and calculating the R-squared for each piece.

Therefore, the actual imputation accuracy will be the same no matter which approach you take, but the R-squared can be different.

Best,

Ketian

On Fri, Sep 17, 2021 at 7:05 AM albaicans @.***> wrote:

Hello,

we are trying to optimize our pipeline of phasing and imputation using Eagle and Minimac4 and the 1000 Genomes reference panel and we would like to know your suggestions regarding the best strategy for imputing heterogeneous datasets.

Our dataset contains individuals with different ancestries in different proportions: most of the individuals have a European ancestry but we also have a smaller group of admixed European-African and even smaller groups of East Asian and African, as well as several individuals with admixed American ancestry. We are interested in using the imputed data in an association analysis including all ancestries.

Our first approach was to phase and impute all samples together but we realized that the imputation accuracy (based on R squared distributions and alternate allele dosages) was not as good as with homogenous datasets. We did some tests imputing different ancestries separately (still using the whole reference panel) and we got better results for the populations with big sample size (European and admixed European-African) but for the populations with small sample size this is not clear. The overall accuracy of the variants that pass the quality filter (R squared > 0.3) is higher if we impute them alone compared to when imputing them together with all the other samples, but we lose about half of the variants, probably because of low MAF that translates to low R squared.

Based on your experience and your knowledge of the imputation algorithm and the calculation of the accuracy, what’s the best approach when phasing/imputing heterogeneous datasets? It looks like we are getting better results when imputing different populations separately but we are not sure how much a small sample size (let’s say 15 individuals) can affect the imputation result and the accuracy estimation.

Thank you in advance!

Best,

Alba

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/44, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6UVLIBCBF3VADATWO5TUTUCMOHRANCNFSM5EGYQEDA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

albaicans commented 3 years ago

Hi Ketian, thank you very much for your reply. I also thought the difference would be only on the estimated R-squared, so I compared the dosages of the samples of interest using the two strategies (imputed alone or together with the other samples). I calculated the distance to the closest integer as a measure of uncertainty or accuracy, calculated the median distance for each variant and then plotted the distribution of the medians. I got better accuracy (more variants with a median distance close to 0) when the samples had been imputed alone. Even though I realize there could be slight differences between the results of different imputation runs, the imputation of homogenous populations was always more accurate in this sense. I'm thinking that the difference could also come from the phasing step. We used Eagle with the 1000 Genomes reference panel to do phasing before imputation, and we tested the whole pipeline of phasing-imputation with the same sets of samples. Do you know if the phasing result can be affected by the samples phased together and this translates into different imputation results? Sorry, you might not be the right person to ask this. Thank you! Best, Alba

yukt commented 3 years ago

Thank you for your clarification. Eagle may augment the reference panel with inferred target haplotypes. I believe this feature is triggered by default when the number of target samples is larger than half of the reference sample size, so if your sample size >= 1252 when phasing with 1000G, the results could be affected by samples phased together, which may decrease the accuracy for non-European samples (given that your samples are dominantly European). You could turn this feature off by setting --pbwtIters 1 when running eagle2. However, I am not the right person to ask about phasing. You may need to consult and confirm these details with the author of Eagle2.

Best,

Ketian

albaicans commented 3 years ago

Thanks a lot, this was very useful.

Best,

Alba

albaicans commented 2 years ago

Hi again, I just wanted to inform you about the follow-up on this issue in case someone reads it in the future. As stated by Eagle2 author, indeed the phasing algorithm will produce different results depending on which samples are phased together, but the general rule of thumb is that phasing samples together tends to be no worse (and usually better) than separating samples by ancestry. After finding a mistake in my code, I reran the tests comparing phasing all samples together with phasing them by ancestry cluster and got similar results, with no significant increase in imputation accuracy when phasing separately. Consequently, we ended up phasing all samples together. Sorry for the confusion! Alba

Shicheng-Guo commented 2 years ago

I am wondering can you share a bash script/demo script to show how eagle2 + minimac4 for phasing and imputation? Thanks.

statgen / Minimac4

Imputing population heterogeneous datasets #44