odelaneau / GLIMPSE

Low Coverage Calling of Genotypes
MIT License
129 stars 26 forks source link

Impute samples together or individually #69

Closed PengZhangJHU closed 4 months ago

PengZhangJHU commented 2 years ago

Hi Simone,

You showed imputation of one sample in the tutorial for chr22. For chrX, you showed imputation of two samples where you merged the GL calls (step3) for the two samples before the imputation step (step5). Do you always recommend impute all the test samples together when possible? Or it won't make much difference in terms of imputation accuracy if you impute each sample individually? thank you!

srubinacci commented 2 years ago

Hi, For efficiency reasons, I would impute many samples at the same time, as it would amortise the constant costs of creating internal structures and reading files. Seems like that batches of >= 100 samples somewhat optimise the process. We showed this as Supp Figure 11B (https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-020-00756-0/MediaObjects/41588_2020_756_MOESM1_ESM.pdf) Also, it is true that pulling samples together allows GLIMPSE to benefit from the joint model, and this should "increase" the size of the reference panel with the other target samples (if the reference panel is not big and the targets are all from the same population). We somewhat show this as Supp Figure 2.

Therefore, my recommendation is to use more samples, if you can, mainly for efficiency reasons. It's likely you don't see much difference in terms of accuracy though.

Hope this helps,

Simone

PengZhangJHU commented 2 years ago

Thank you Simone, this is very helpful. Right now I have the pipeline set up running each sample separately in parallel, I will try the batch-run later when I have projects with more samples.

biona001 commented 11 months ago

Sorry to revive this thread.

For GLIMPSE2 model, I wonder if it is still true that including more samples per batch would result in better runtime overall? Somehow I was under the impression that GLIMPSE2 is designed to perform imputation sample-by-sample, is this correct?

ben

srubinacci commented 11 months ago

Hi,

Indeed, it is true that GLIMPSE2 does imputation of each sample against the ref panel, but you still have to account some "fixed" costs involved with reference panel processing in each iteration (e.g. PBWT). In Supp Fig. 6C of the GLIMPSE2 paper we show the scaling with number of target samples. A batch size of ~100 samples seems to be fairly optimal.

Hope this helps

Simone

NahlaHu commented 7 months ago

Hello,

We intend to use GLIMPSE for validation of low coverage sequence data. Our species of interest is of introgressed allele, means we have trialleleic snps. I want to ask you if this tool will work for this case or not?

ssinh10 commented 5 months ago

@srubinacci Hi sorry for reviving this thread but I have a very similar doubt. Can we use the "Merging genotype likelihoods of multiple individuals" steps from _https://odelaneau.github.io/GLIMPSE/glimpse1/tutorial_b38.html#run_likelihoods_

I was under the impression that glimpse2 can only be used for sample-by-sample imputation.

I am working on more than 100 samples. So would you suggest to follow the tutorial for chr22 for glimpse2 or is it better to create a merged sample file with GLs before imputing. Thank you for any insights and clarification you can provide.

srubinacci commented 5 months ago

@NahlaHu GLIMPSE would handle triallelic snps by requiring you to split into into biallelic. Of course this might not be the best behaviour for your use case.

@ssinh10 GLIMPSE2 imputes only against a reference panel (no other imputed sample is used to improve imputation of the others). However, as I mentioned above, for computational reasons, due to the processing of the reference panel (and here I am mainly talking about very large deeply sequenced panels, such as the UK Biobank), you might benefit by pulling your samples together. Also, by pulling samples together, you will have a single VCF/BCF file in output, and that can be convenient.

Finding the optimal value is dependent on the time spent to process the reference panel, PBWT computation, etc. My rule of thumb is to pull at least 100 samples together if possible.

A small note on this: you are looking a tutorial for GLIMPSE1. GLIMPSE1 actually uses all samples for the imputation (contrary to GLIMPSE2).

ssinh10 commented 5 months ago

Thanks @srubinacci , I understand that the link is from GLIMPSE1. I was wondering if I can pull samples together and work in GLIMPSE2. From your answer it looks like I can create a per sample GLs merge it and them impute it in GLIMPSE2 as well. Knowing that GLIMPSE2 only imputes against reference panel.

Thank you for your help.

Zepeng-Mu commented 4 months ago

I wonder how INFO scores will look like if I only impute one sample at a time? If I remember correctly (I could be wrong), in GLIMPSE 1 when I tried it on just one sample, the INFO scores are just 0 and 1.

srubinacci commented 4 months ago

@ssinh10 yes, you can absolutely use more than one sample in your VCF using GLIMPSE2!

@Zepeng-Mu the info score is meant to be a valid statistics for a large number of samples. A description is given in Marchini and Howie 2010, Supplementary S3: https://www.nature.com/articles/nrg2796#Sec9

You might want to merge your cohort and then recompute the INFO score (e.g. using bcftools +impute-info).

Best,

Simone