Closed caimiao0714 closed 2 years ago
Just to explain the issue, hoping that this could help you get the point more clearly.
I know that the issue is when I create the list of snp id in snp_readBGEN
, I pasted allele 2 before allele 1 (list(map_bed[,paste0(chromosome, '_', physical.pos, '_', allele2, '_', allele1)])
), but if I paste allele 1 before allele 2, I would not get any snp data because snp_readBGEN
will not find any matching snps.
So the real issue is still that the reference alleles in .bed
files and .bgen
files read by snp_readBGEN are different.
I think it is a problem with the UKBB. I vaguely remember seeing this, where the alleles in BGI files were inverted compared to the ones in MFI files.
I checked the .mfi
files. It seems that the reference alleles in .mfi
files are consistent with those in .bgi
files:
mfi22 = fread('/data1/ShareData/UKBB/Genetics/Imputation MAF+info/ukb_mfi_chr22_v3.txt')
mfi22[V2 %in% c('rs62224609', 'rs587646183', 'rs62224614', 'rs7286962', 'rs9604721',
'rs368226325', 'rs374914422', 'rs191117135', 'rs200607599', 'rs370652263')]
V1 V2 V3 V4 V5 V6 V7 V8
1: 22:16051249_T_C rs62224609 16051249 T C 0.1008420 C 0.969322
2: 22:16052463_T_C rs587646183 16052463 T C 0.0129251 C 0.260530
3: 22:16053862_C_T rs62224614 16053862 C T 0.1024930 T 0.973160
4: 22:16054454_C_T rs7286962 16054454 C T 0.1056720 T 0.949779
5: 22:16054713_C_T rs9604721 16054713 C T 0.0141655 T 0.422478
6: 22:51231220_A_G rs368226325 51231220 A G 0.0542445 G 0.868414
7: 22:51231754_C_T rs374914422 51231754 C T 0.0277678 T 0.771012
8: 22:51234799_G_A rs191117135 51234799 G A 0.0151589 A 0.780764
9: 22:51237364_A_G rs200607599 51237364 A G 0.0152464 G 0.517684
10: 22:51237712_G_A rs370652263 51237712 G A 0.0562941 A 0.860237
map_bgen # read from .bgen and .bgi files
chromosome marker.ID rsid physical.pos allele1 allele2 freq info
1: 22 22:16051249_T_C rs62224609 16051249 T C 0.10084166 0.9693216
2: 22 22:16052463_T_C rs587646183 16052463 T C 0.01292510 0.2605300
3: 22 22:16053862_C_T rs62224614 16053862 C T 0.10249263 0.9731603
4: 22 22:16054454_C_T rs7286962 16054454 C T 0.10567225 0.9497794
5: 22 22:16054713_C_T rs9604721 16054713 C T 0.01416555 0.4224776
---
83425: 22 22:51231220_A_G rs368226325 51231220 A G 0.05424451 0.8684137
83426: 22 22:51231754_C_T rs374914422 51231754 C T 0.02776781 0.7710121
83427: 22 22:51234799_G_A rs191117135 51234799 G A 0.01515887 0.7807641
83428: 22 22:51237364_A_G rs200607599 51237364 A G 0.01524641 0.5176840
83429: 22 22:51237712_G_A rs370652263 51237712 G A 0.05629411 0.8602371
What puzzles me is that plink only reads .bgen
and .sample
files and it never uses .bgi files. How can it get the snp reference alleles completely opposite of those read by bigsnpr
? Does the results above mean the reference allele in .bgen
files different from those in .bgi
files?
snp_readBGEN()
is just reporting the alleles that are read from the BGI files.
Also it might just be that PLINK bed files are storing the number of reference alleles while BGEN are storing the probabilities for alternative alleles.
Is it really a problem that these are inverted?
The problem is that I don't know which data set should I use for the true reference alleles. I need reference alleles to calculate polygenic risk scores (based on other studies' weights). Till now, I'm still not sure which dataset/software could give me the correct reference alleles.
Sorry about this. I don't think I get your point on the difference between reference allele info from BGEN
(read by plink
) and BGI
(read by bigsnpr
). Ideally, shouldn't they have exactly the same set of reference alleles?
Quote from Christopher Chang:
Reread the bigsnpr output: it only refers to "allele1" and "allele2", not "REF"/"ALT".
The original convention for plink .bim files was to store minor alleles in "allele1" and major alleles in "allele2". Since the reference allele is usually major, the plink2 convention is to store REF=allele2.
So this seems to an issue of the definition of reference allele in plink2. plink2 .bed
data refer to allele 2 as the reference allele (major allele), while .bgi
data and bigsnpr
package refer to allele 1 as the reference allele (major allele). I hope that my understanding is correct.
Probably. There is no convention in bigsnpr, it is just reading this information from either bim files or bgi files. You just need some external data (e.g. allele frequencies) to make sure which one is REF and ALT.
Ok. Thanks a lot for the help.
I was trying to read the same UK Biobank data using plink converted binary files and from official
.bgen
files separately, and I was expecting identical results. However, I got completely opposite reference alleles. I will try to show the issue using chr22 as an example.I converted the original
.bgen
files to binary.bed
files using plink2 (see the output messages below). Theref-first
argument was set according to this thread. I suspect this argument is causing the problem, but it makes sense to me (the first allele is the reference allele, isn't it?).Then I read it into R using
bigsnpr
:On the other hand, I read the chromosome directly from
.bgen
and.bgi
files:However, when I compare the two files, the reference allele does not seem to match. Actually they are in perfectly opposite direction.
I'm not sure if I'm using
plink2
in a wrong way or I did not setbigsnpr
correctly. To my understanding, the reference allele should be identical right? Any suggestion or comment would be appreciated.Thanks, Miao