privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

mismatched allele1 compared to `.bed` file converted using plink2 #339

Closed caimiao0714 closed 2 years ago

caimiao0714 commented 2 years ago

I was trying to read the same UK Biobank data using plink converted binary files and from official .bgen files separately, and I was expecting identical results. However, I got completely opposite reference alleles. I will try to show the issue using chr22 as an example.

I converted the original .bgen files to binary .bed files using plink2 (see the output messages below). The ref-first argument was set according to this thread. I suspect this argument is causing the problem, but it makes sense to me (the first allele is the reference allele, isn't it?).

(base) [caim@localhost Imputation BGEN]$ plink2 \
>         --bgen ukb_imp_chr22_v3.bgen ref-first \
>         --sample ../Imputation\ sample/ukb22828_c22_b0_v3_s487203.sample \
>         --make-bed \
>         -out ../plink_imp/plink2_ukb_imp_chr22 \
>         --maf 0.01 \
>         --geno 0.05 \
>         --hwe 0.000001 \
>         --mind 0.05 \
>         --threads 96
PLINK v2.00a3LM 64-bit Intel (3 May 2022)      www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ../plink_imp/plink2_ukb_imp_chr22.log.
Options in effect:
  --bgen ukb_imp_chr22_v3.bgen ref-first
  --geno 0.05
  --hwe 0.000001
  --maf 0.01
  --make-bed
  --mind 0.05
  --out ../plink_imp/plink2_ukb_imp_chr22
  --sample ../Imputation sample/ukb22828_c22_b0_v3_s487203.sample
  --threads 96

Start time: Mon May 23 03:26:45 2022
515404 MiB RAM detected; reserving 257702 MiB for main workspace.
Using up to 96 threads (change this with --threads).
--bgen: 1255683 variants detected, format v1.2.
487409 samples imported from .sample file to
../plink_imp/plink2_ukb_imp_chr22-temporary.psam .
--bgen: ../plink_imp/plink2_ukb_imp_chr22-temporary.pgen +
../plink_imp/plink2_ukb_imp_chr22-temporary.pvar written.
487409 samples (264247 females, 222956 males, 206 ambiguous; 487409 founders)
loaded from ../plink_imp/plink2_ukb_imp_chr22-temporary.psam.
1255683 variants loaded from ../plink_imp/plink2_ukb_imp_chr22-temporary.pvar.
Note: No phenotype data present.
Calculating sample missingness rates... done.
0 samples removed due to missing genotype data (--mind).
487409 samples (264247 females, 222956 males, 206 ambiguous; 487409 founders)
remaining after main filters.
Calculating allele frequencies... done.
--geno: 20913 variants removed due to missing genotype data.
--hwe: 111054 variants removed due to Hardy-Weinberg exact test (founders
only).
1040287 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
83429 variants remaining after main filters.
Writing ../plink_imp/plink2_ukb_imp_chr22.fam ... done.
Writing ../plink_imp/plink2_ukb_imp_chr22.bim ... done.
Writing ../plink_imp/plink2_ukb_imp_chr22.bed ... done.
End time: Mon May 23 03:30:49 2022

Then I read it into R using bigsnpr:

pacman::p_load(dplyr, data.table, bigsnpr)
d_bed = snp_readBed2(
  bedfile = '/data1/ShareData/UKBB/Genetics/plink_imp/plink2_ukb_imp_chr22.bed',
  backingfile = 'Data/chr22_bed',
  ncores = 1
)

# load .bed map file into R
map_bed = readRDS('Data/chr22_bed.rds') %>% 
  .[['map']] %>% 
  as.data.table()

On the other hand, I read the chromosome directly from .bgen and .bgi files:

d_bgen = snp_readBGEN(
  bgenfiles = '/data1/ShareData/UKBB/Genetics/Imputation BGEN/ukb_imp_chr22_v3.bgen',
  backingfile = 'Data/chr22_bgen',
  bgi_dir = "/data1/ShareData/UKBB/Genetics/Imputation BGI",
  list_snp_id = list(map_bed[,paste0(chromosome, '_', physical.pos, '_', allele2, '_', allele1)]),
  ncores = 1
)

# load .bgen map file into R
map_bgen = readRDS('Data/chr22_bgen.rds') %>% 
  .[['map']] %>% 
  as.data.table()

However, when I compare the two files, the reference allele does not seem to match. Actually they are in perfectly opposite direction.

> map_bed
       chromosome   marker.ID genetic.dist physical.pos allele1 allele2
    1:         22  rs62224609            0     16051249       C       T
    2:         22 rs587646183            0     16052463       C       T
    3:         22  rs62224614            0     16053862       T       C
    4:         22   rs7286962            0     16054454       T       C
    5:         22   rs9604721            0     16054713       T       C
   ---                                                                 
83425:         22 rs368226325            0     51231220       G       A
83426:         22 rs374914422            0     51231754       T       C
83427:         22 rs191117135            0     51234799       A       G
83428:         22 rs200607599            0     51237364       G       A
83429:         22 rs370652263            0     51237712       A       G
> map_bgen
       chromosome       marker.ID        rsid physical.pos allele1 allele2       freq      info
    1:         22 22:16051249_T_C  rs62224609     16051249       T       C 0.10084166 0.9693216
    2:         22 22:16052463_T_C rs587646183     16052463       T       C 0.01292510 0.2605300
    3:         22 22:16053862_C_T  rs62224614     16053862       C       T 0.10249263 0.9731603
    4:         22 22:16054454_C_T   rs7286962     16054454       C       T 0.10567225 0.9497794
    5:         22 22:16054713_C_T   rs9604721     16054713       C       T 0.01416555 0.4224776
   ---                                                                                         
83425:         22 22:51231220_A_G rs368226325     51231220       A       G 0.05424451 0.8684137
83426:         22 22:51231754_C_T rs374914422     51231754       C       T 0.02776781 0.7710121
83427:         22 22:51234799_G_A rs191117135     51234799       G       A 0.01515887 0.7807641
83428:         22 22:51237364_A_G rs200607599     51237364       A       G 0.01524641 0.5176840
83429:         22 22:51237712_G_A rs370652263     51237712       G       A 0.05629411 0.8602371

I'm not sure if I'm using plink2 in a wrong way or I did not set bigsnpr correctly. To my understanding, the reference allele should be identical right? Any suggestion or comment would be appreciated.

Thanks, Miao

caimiao0714 commented 2 years ago

Just to explain the issue, hoping that this could help you get the point more clearly.

I know that the issue is when I create the list of snp id in snp_readBGEN, I pasted allele 2 before allele 1 (list(map_bed[,paste0(chromosome, '_', physical.pos, '_', allele2, '_', allele1)])), but if I paste allele 1 before allele 2, I would not get any snp data because snp_readBGEN will not find any matching snps.

So the real issue is still that the reference alleles in .bed files and .bgen files read by snp_readBGEN are different.

privefl commented 2 years ago

I think it is a problem with the UKBB. I vaguely remember seeing this, where the alleles in BGI files were inverted compared to the ones in MFI files.

caimiao0714 commented 2 years ago

I checked the .mfi files. It seems that the reference alleles in .mfi files are consistent with those in .bgi files:

mfi22 = fread('/data1/ShareData/UKBB/Genetics/Imputation MAF+info/ukb_mfi_chr22_v3.txt')
mfi22[V2 %in% c('rs62224609', 'rs587646183', 'rs62224614', 'rs7286962', 'rs9604721', 
                'rs368226325', 'rs374914422', 'rs191117135', 'rs200607599', 'rs370652263')]
                 V1          V2       V3 V4 V5        V6 V7       V8
 1: 22:16051249_T_C  rs62224609 16051249  T  C 0.1008420  C 0.969322
 2: 22:16052463_T_C rs587646183 16052463  T  C 0.0129251  C 0.260530
 3: 22:16053862_C_T  rs62224614 16053862  C  T 0.1024930  T 0.973160
 4: 22:16054454_C_T   rs7286962 16054454  C  T 0.1056720  T 0.949779
 5: 22:16054713_C_T   rs9604721 16054713  C  T 0.0141655  T 0.422478
 6: 22:51231220_A_G rs368226325 51231220  A  G 0.0542445  G 0.868414
 7: 22:51231754_C_T rs374914422 51231754  C  T 0.0277678  T 0.771012
 8: 22:51234799_G_A rs191117135 51234799  G  A 0.0151589  A 0.780764
 9: 22:51237364_A_G rs200607599 51237364  A  G 0.0152464  G 0.517684
10: 22:51237712_G_A rs370652263 51237712  G  A 0.0562941  A 0.860237

map_bgen # read from .bgen and .bgi files
       chromosome       marker.ID        rsid physical.pos allele1 allele2       freq      info
    1:         22 22:16051249_T_C  rs62224609     16051249       T       C 0.10084166 0.9693216
    2:         22 22:16052463_T_C rs587646183     16052463       T       C 0.01292510 0.2605300
    3:         22 22:16053862_C_T  rs62224614     16053862       C       T 0.10249263 0.9731603
    4:         22 22:16054454_C_T   rs7286962     16054454       C       T 0.10567225 0.9497794
    5:         22 22:16054713_C_T   rs9604721     16054713       C       T 0.01416555 0.4224776
   ---                                                                                         
83425:         22 22:51231220_A_G rs368226325     51231220       A       G 0.05424451 0.8684137
83426:         22 22:51231754_C_T rs374914422     51231754       C       T 0.02776781 0.7710121
83427:         22 22:51234799_G_A rs191117135     51234799       G       A 0.01515887 0.7807641
83428:         22 22:51237364_A_G rs200607599     51237364       A       G 0.01524641 0.5176840
83429:         22 22:51237712_G_A rs370652263     51237712       G       A 0.05629411 0.8602371

What puzzles me is that plink only reads .bgen and .sample files and it never uses .bgi files. How can it get the snp reference alleles completely opposite of those read by bigsnpr? Does the results above mean the reference allele in .bgen files different from those in .bgi files?

privefl commented 2 years ago

snp_readBGEN() is just reporting the alleles that are read from the BGI files. Also it might just be that PLINK bed files are storing the number of reference alleles while BGEN are storing the probabilities for alternative alleles.

Is it really a problem that these are inverted?

caimiao0714 commented 2 years ago

The problem is that I don't know which data set should I use for the true reference alleles. I need reference alleles to calculate polygenic risk scores (based on other studies' weights). Till now, I'm still not sure which dataset/software could give me the correct reference alleles.

Sorry about this. I don't think I get your point on the difference between reference allele info from BGEN (read by plink) and BGI (read by bigsnpr). Ideally, shouldn't they have exactly the same set of reference alleles?

caimiao0714 commented 2 years ago

Quote from Christopher Chang:

Reread the bigsnpr output: it only refers to "allele1" and "allele2", not "REF"/"ALT".

The original convention for plink .bim files was to store minor alleles in "allele1" and major alleles in "allele2".  Since the reference allele is usually major, the plink2 convention is to store REF=allele2.

So this seems to an issue of the definition of reference allele in plink2. plink2 .bed data refer to allele 2 as the reference allele (major allele), while .bgi data and bigsnpr package refer to allele 1 as the reference allele (major allele). I hope that my understanding is correct.

privefl commented 2 years ago

Probably. There is no convention in bigsnpr, it is just reading this information from either bim files or bgi files. You just need some external data (e.g. allele frequencies) to make sure which one is REF and ALT.

caimiao0714 commented 2 years ago

Ok. Thanks a lot for the help.