privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Question: Do we need bigsnpr imputation in UK biobank imputation data? #222

Closed jasmine9764 closed 3 years ago

jasmine9764 commented 3 years ago

Dear Florian, Thank you for developing such an amazing tool. As we found that there are still missing variants in UKB imputation and found it is possible to impute genotyped variants in bigsnpr, we are wondering: 1.Do you have experience on imputing ukb imputation data? (72m SNPs after QC) Would this method improve our predictive accuracy? 2.How much of CPU scale, computational power and time effort needed?

We look forward to learning from your experience. Thank you in advance!

privefl commented 3 years ago

Hi,

  1. I have never found any missing values in the UKBB BGEN data; are these rare variants? Could you share one of these variants (<chr>_<pos>_<a0>_<a1>) please?

  2. I have only imputed genotyped data. Not sure about imputing already imputed data. I guess it would depend on many things, including the number of missing values (cf. I need to check some of them).

  3. Again, depends on the number of (variants with) missing values.

xscapex commented 3 years ago

Hi Florian,

The following are some of the SNPs that I have found missing in the UKBB BGEN data:

21:9411602_T_C  21:9411645_A_G 21:9411785_G_T

Thanks

privefl commented 3 years ago

This the code I have just tried: image

I don't see any missing values in my data. What could be different?

privefl commented 3 years ago

Please verify that the UKBB files have been downloaded properly. By e.g. comparing md5sums tools::md5sum("ukb_imp_chr21_v3.bgen") with the ones there: https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=997.