privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
196 stars 44 forks source link

Different PGS for the same person #517

Closed trinetn closed 1 month ago

trinetn commented 1 month ago

Hi again I have one more question. I have been running the same piece of code but with different bigSNP objects. One object with ~12K individuals and one object with 3 individuals that I have extracted from my 12K individuals (I used Plink to do this on the original bed files). I have prepared the two objects the same way using snp_readBed() and restricting my SNPs to hapmap3+ SNPs. When I run my code with the same weights, I can see I get slight different PGSs for the same individuals.

The code that I run:

#--------------------------------------------------------------------------------------
#load libraries 
library(bigsnpr)
library(dplyr)

#Load weights and my weight SNPs
load("Beta_estimates.RData")
df_beta <- read.table(file="df_beta",header=T)

#Load big SNP object - I load one of the bigSNP object seen below
obj.bigSNP <- snp_attach("3ind_subset.rds")
obj.bigSNP <- snp_attach("All_12K.rds")

#A couple of variables are defined for the big SNP object
G   <- obj.bigSNP$genotypes

map <- dplyr::transmute(obj.bigSNP$map,
chr = chromosome, pos = physical.pos,a0 = allele2, a1 = allele1,rsid=marker.ID)
map_pgs <- df_beta[1:5]
map_pgs$beta <- final_beta_auto

#I identify variants that were used to estimate the weights and create a beta that reflects a potential allele flip
map_pgs2 <- snp_match(map_pgs, map,join_by_pos=F)

#I calculate the PGSs
big_prodVec(G, map_pgs2$beta,
                         ind.col = map_pgs2[["_NUM_ID_"]],ind.row=c(1,2))
#----------------------------------------------------------------------------------------------------

An example could be, when I run with the bigSNP object with all my individuals, I get an PGS around 0.0003 and when I use the other bigSNP object I get a PGS around -0.0005. When I check this persons genotype information, I can see around 1 mio SNPs have the same genotype(0,1or2) and same a1 in the two objects and around 80K SNPs have a different genotype and a1 in the two objects (ex. in one object the SNP has the genotype 0 and a1=A and in the other object the SNP has the genotype 2 and a1=G) . I can see these 80K SNPs are the SNPs that are "flipped" with the snp_match function so I think the code take cares of this difference between the two objects. Should I get the same PGS for the same individual in different bigSNP objects? Is something wrong in my code?

Bw TrineTN

privefl commented 1 month ago

Maybe PLINK change the minor allele when subsetting, not sure. Then I guess you can have a shift, which should be the same for all individuals.

trinetn commented 1 month ago

Thank you for your quick response I just check the difference for my three individuals (that I have in both objects) and indeed the difference for the PGSs are the same. I guess if I want to obtain the exact same PGSs, I need to go back to my Plink script and specify which allele should be the minor allele, is that correct?

privefl commented 1 month ago

After a bit of search, it seems you can use --keep-allele-order (cf. https://www.cog-genomics.org/plink/1.9/data#ax_allele).

trinetn commented 1 month ago

I have rerun Plink and LDpred2 and I now get the same PGSs for my individuals with the two different big SNP object.

Thank you for the tip!