what(): Only 2 alleles allowed.

caimiao0714 commented 3 years ago

I'm trying to extract only a few SNPs associated with BMI (the SNP IDs from this Nature paper by Locke et al. (2015)) from the UKBB data. However, when I try to read these SNPs using bigsnpr::snp_readBGEN, I encounter this error message:

terminate called after throwing an instance of 'Rcpp::exception'
  what():  Only 2 alleles allowed.

Below is my code and

> pacman::p_load(dplyr, fst, data.table, bigsnpr, bigstatsr, bigreadr, doParallel)
> 
> 
> list_snp_id_BMI = readRDS('/fs/ess/PMIU0180/UKBB/Data/generated/genetics/list_snp_id_BMI.rds')
> 
> system.time(
+   rds <- bigsnpr::snp_readBGEN(
+     bgenfiles = glue::glue("/fs/ess/PMIU0180/UKBB/Data/Imputation BGEN/ukb_imp_chr{chr}_v3.bgen", chr = 1:22),
+     list_snp_id = list_snp_id_BMI,
+     backingfile = "/fs/ess/PMIU0180/UKBB/Data/generated/genetics/imp_bgen_BMI",
+     #ind_row = ind.indiv[sub2],
+     bgi_dir = "/fs/ess/PMIU0180/UKBB/Data/Imputation BGI",
+     ncores = 40
+   )
+ )
terminate called after throwing an instance of 'Rcpp::exception'
  what():  Only 2 alleles allowed.

Here is the 96 SNPs specified by Locke et al. (2015)

> list_snp_id_BMI
[[1]]
 [1] "1_47684677_T_G"  "1_49589847_A_G"  "1_50559820_C_T"  "1_72751185_T_C"  "1_75002193_G_A" 
 [6] "1_78446761_G_A"  "1_96924097_C_T"  "1_110154688_T_C" "1_177889480_A_G" "1_201784287_A_C"

[[2]]
 [1] "2_632348_A_G"    "2_25150296_A_G"  "2_26928811_G_A"  "2_59305625_T_C"  "2_63053048_G_A" 
 [6] "2_143043285_C_T" "2_164567689_T_C" "2_181550962_C_T" "2_208255518_A_G" "2_213413231_G_A"
[11] "2_219349752_C_T" "2_227092802_A_G"

[[3]]
[1] "3_25106437_A_G"  "3_61236462_C_T"  "3_81792112_C_A"  "3_85807590_T_G"  "3_141275436_G_T"
[6] "3_185824004_T_C"

[[4]]
[1] "4_45182527_A_G"  "4_77129568_C_G"  "4_103188709_C_T" "4_145659064_T_C"

[[5]]
[1] "5_75015242_T_G"  "5_153537893_G_T"

[[6]]
[1] "6_34563164_A_G"  "6_40348653_A_G"  "6_50845490_A_G"  "6_108977663_T_C" "6_120185665_C_T"
[6] "6_137675541_A_G" "6_163033350_A_G"

[[7]]
[1] "7_75163169_A_G" "7_76608143_C_T" "7_93197732_G_C" "7_95169514_G_T"

[[8]]
[1] "8_76806584_T_C" "8_81375457_C_T" "8_85079709_T_C"

[[9]]
[1] "9_15634326_T_C"  "9_28414339_A_G"  "9_111932342_C_T" "9_120378483_T_C" "9_129460914_A_G"

[[10]]
[1] "10_87410904_A_G"  "10_102395440_T_C" "10_104869038_T_C" "10_114758349_C_T"

[[11]]
[1] "11_8673939_C_G"   "11_27684517_A_G"  "11_43864278_T_C"  "11_47650993_C_T"  "11_115022404_A_G"

[[12]]
[1] "12_50247468_G_A"  "12_122781897_G_A"

[[13]]
[1] "13_54102206_G_A" "13_66205704_A_G" "13_79580919_G_A"

[[14]]
[1] "14_25928179_C_A" "14_29736838_C_A" "14_30515112_C_T" "14_79899454_C_T"

[[15]]
[1] "15_51748610_A_G" "15_68077168_T_C" "15_73093991_C_T"

[[16]]
[1] "16_3627358_C_T"  "16_19935389_G_A" "16_28333411_G_A" "16_28889486_C_A" "16_30015337_A_G"
[6] "16_31129895_A_G" "16_49062590_C_A" "16_53803574_T_A"

[[17]]
[1] "17_2005136_C_G"  "17_5283252_A_G"  "17_78615571_G_A"

[[18]]
[1] "18_21104888_C_T" "18_40147671_G_A" "18_56883319_T_G" "18_57829135_T_C"

[[19]]
[1] "19_18454825_A_G" "19_34309532_A_G" "19_45395619_A_G" "19_46202172_C_T" "19_47569003_G_A"

[[20]]
[1] "20_51087862_C_T"

[[21]]
[1] "21_40291740_T_C"

[[22]]
character(0)

Based on my visual inspection, all the SNPs do seem to have 2 alleles. I also checked the MD5 sums of all my .bgen and .bgen.bgi files and they are all correct.

Any suggestions/comments would be appreciated.

Thanks, Miao

privefl commented 3 years ago

Could you share this list with me using dput()?

caimiao0714 commented 3 years ago

Sure. Please refer to the code below.

list(c("1_47684677_T_G", "1_49589847_A_G", "1_50559820_C_T", 
"1_72751185_T_C", "1_75002193_G_A", "1_78446761_G_A", "1_96924097_C_T", 
"1_110154688_T_C", "1_177889480_A_G", "1_201784287_A_C"), c("2_632348_A_G", 
"2_25150296_A_G", "2_26928811_G_A", "2_59305625_T_C", "2_63053048_G_A", 
"2_143043285_C_T", "2_164567689_T_C", "2_181550962_C_T", "2_208255518_A_G", 
"2_213413231_G_A", "2_219349752_C_T", "2_227092802_A_G"), c("3_25106437_A_G", 
"3_61236462_C_T", "3_81792112_C_A", "3_85807590_T_G", "3_141275436_G_T", 
"3_185824004_T_C"), c("4_45182527_A_G", "4_77129568_C_G", "4_103188709_C_T", 
"4_145659064_T_C"), c("5_75015242_T_G", "5_153537893_G_T"), c("6_34563164_A_G", 
"6_40348653_A_G", "6_50845490_A_G", "6_108977663_T_C", "6_120185665_C_T", 
"6_137675541_A_G", "6_163033350_A_G"), c("7_75163169_A_G", "7_76608143_C_T", 
"7_93197732_G_C", "7_95169514_G_T"), c("8_76806584_T_C", "8_81375457_C_T", 
"8_85079709_T_C"), c("9_15634326_T_C", "9_28414339_A_G", "9_111932342_C_T", 
"9_120378483_T_C", "9_129460914_A_G"), c("10_87410904_A_G", "10_102395440_T_C", 
"10_104869038_T_C", "10_114758349_C_T"), c("11_8673939_C_G", 
"11_27684517_A_G", "11_43864278_T_C", "11_47650993_C_T", "11_115022404_A_G"
), c("12_50247468_G_A", "12_122781897_G_A"), c("13_54102206_G_A", 
"13_66205704_A_G", "13_79580919_G_A"), c("14_25928179_C_A", "14_29736838_C_A", 
"14_30515112_C_T", "14_79899454_C_T"), c("15_51748610_A_G", "15_68077168_T_C", 
"15_73093991_C_T"), c("16_3627358_C_T", "16_19935389_G_A", "16_28333411_G_A", 
"16_28889486_C_A", "16_30015337_A_G", "16_31129895_A_G", "16_49062590_C_A", 
"16_53803574_T_A"), c("17_2005136_C_G", "17_5283252_A_G", "17_78615571_G_A"
), c("18_21104888_C_T", "18_40147671_G_A", "18_56883319_T_G", 
"18_57829135_T_C"), c("19_18454825_A_G", "19_34309532_A_G", "19_45395619_A_G", 
"19_46202172_C_T", "19_47569003_G_A"), "20_51087862_C_T", "21_40291740_T_C", 
    character(0))

privefl commented 3 years ago

Thanks, I am able to run the following code without any issue:

caimiao0714 commented 3 years ago

Ok. Thank you! What is this tempfile() on the 5th line? In addition, you put the .bgen.bgi files in the .bgen folder right? I'm not sure if that makes a difference, but I'll try it. I also find out that I'm using bigsnpr 1.6.1, so I'll try to update the package and see if it makes a difference.

privefl commented 3 years ago

tempfile() just generates some file name to be placed in some temporary directory.
where you put the BGI files does not have any importance, and this is used before you get this issue (which is at the C++ level when reading the BGEN data, after having read the BGI data).
could you identify which chromosome is a problem? (by restricting bgenfiles and list_snp_id)
yes, you can try to update the package as well

caimiao0714 commented 3 years ago

Thank you!

I've run the snp_readBGEN for each chromosome one-by-one, and the issue seems to be in c10 and c11. And I checked the MD5 sum again and found that they are NOT consistent with the official website again.

Oh my lord, what's going on with my server??? I have another copy with the correct MD5 sum in another folder, and I copied them to this current folder, but these copied files have the wrong MD5 sum. Is it common to have this type of error when copying files from folders to folders? I'll close this issue soon.

Thanks, Miao

privefl commented 3 years ago

I don't know if it is common. So the issue is with "corrupted" files?

caimiao0714 commented 3 years ago

Ok. Yes, the issue is because the files were corrupted.

privefl commented 3 years ago

This is the fourth issue about this here. I'm starting to consider storing these md5sums in the package and checking these when the filename is UKBB.

caimiao0714 commented 3 years ago

Thank you! I agree. I think checking md5sums is important. I have an original copy of .bgen files downloaded from UKBB with the correct md5sums. However, I've met the same issue at least 5 times when I used copied files in another folder/project using different servers. These .bgen files seem to be especially vulnerable to copy and paste, even on the same hard drive.

privefl commented 3 years ago

Ok, I'll reopen this to remember to do it.

caimiao0714 commented 3 years ago

Ok. Thank you!

privefl commented 3 years ago

Is your problem because the file was not copied entirely? Then I think I could detect that in another way.

caimiao0714 commented 3 years ago

In my case, the copied files seem to have the same size and modified date time as the original files.

privefl commented 3 years ago

So, same size, but not the same data?

caimiao0714 commented 3 years ago

Yes. At least the numbers were the same at GB level. I've deleted those corrupted files, so I can't verify if the file sizes were exactly the same, but I still remember both of the two bgen files were 113 GBs for chr10.

privefl / bigsnpr

what(): Only 2 alleles allowed. #228