privefl / paper-ldpred2

Paper discribing LDpred2
https://doi.org/10.1093/bioinformatics/btaa1029
16 stars 12 forks source link

Matrix Math #5

Closed lizlitkowski closed 3 years ago

lizlitkowski commented 3 years ago

Hi,

I was trying to check my matrix math and compare to the output of big_prodMat in the bigsnpr package (about 20 lines of code and output attached). I expected to be able to multiply the beta values by the genotypes, sum them up, and get the same answer as in the function. However, it is always a little off. I expected it to be exact. Could this be due to different levels of rounding?

Thanks. Liz

# These are the beta values for all 30 scenarios of h2 and p
beta_auto <- sapply(multi_auto, function(auto) auto$beta_est)

# Chr22 is the test chromosome.  We will do these one chromosome at a time
bedfile = "TAGC_Chr22.bed"
bed = snp_readBed(bedfile, backingfile = sub_bed(bedfile))
TAGC <- snp_attach(bed)
#str(TAGC) # 7729 x 16419 for Chr22, 7729 x 97458 for Chr2
G <- TAGC$genotypes

# Map just to the chromosome we are calculating
map <- dplyr::transmute(TAGC$map,
                        chr = as.integer(chromosome), pos = physical.pos,
                        a0 = allele1, a1 = allele2)  # reversed somehow..
map_pgs <- df_beta[c(1:5,14)]; map_pgs$beta <- 1
map_pgs$hg19 = map_pgs$pos # Keeping this for comparison later
map_pgs$pos = map_pgs$pos_hg38 # We need hg38 because this is the TopMed Sequence genome assembly
map_pgs2 <- snp_match(map_pgs, map)

# _NUM_ID_ seems to be the map into the chromosome sized items 
# while _NUM_ID_.ss seems to map into the full chromosome matrix)
GFBM = as_FBM(G[,map_pgs2[["_NUM_ID_"]]])
pred_auto <- big_prodMat(GFBM, sweep(beta_auto[map_pgs2[["_NUM_ID_.ss"]],], 1, map_pgs2$beta, '*'), ncores = NCORES)

# Do some matrix math to compare what we would expect....this is on Chr22.
ColID = 7; PersonID = 1013
Onebeta = beta_auto[map_pgs2[["_NUM_ID_.ss"]],ColID]
Oneperson = G[PersonID,map_pgs2[["_NUM_ID_"]]]
PRSPerson = sum(Onebeta*(-1)*Oneperson)
print(PRSPerson)
print("Now pred_auto calculation")
print(pred_auto[PersonID,])

# Output below
933,074 variants to be matched.
0 ambiguous SNPs have been removed.
13,358 variants have been matched; 0 were flipped and 13,175 were reversed.

[1] 0.05068206
[1] "Now pred_auto calculation"
[1]  0.0489877538  0.0466978807  0.0491356973  0.0510569618  0.0511219294
[6]  0.0429244722  0.0461996485  0.0480051599  0.0443773547  0.0466585931
[11]  0.0468342975  0.0448386896  0.0494604417  0.0482822561  0.0428231562
[16]  0.0477076403  0.0448193827  0.0495798391  0.0472493303  0.0158506961
[21] -0.0003978807  0.0002496557  0.0003152112  0.0009851252  0.0021061355
[26] -0.0008542866 -0.0009923572 -0.0008881025 -0.0017234227 -0.0015367257
privefl commented 3 years ago
  1. You should not have to match after getting the results from LDpred2, you should always do it before, otherwise you could lose some predictive accuracy.

  2. Do not use GFBM = as_FBM(G[,map_pgs2[["_NUM_ID_"]]]), use instead the ind.col parameter in big_prodMat().

  3. I am not sure where the issue is but maybe it's because you're multiplying by map_pgs2$beta first, and then only uniformly by (-1)?

Anyway, I'm quite sure there is no issue with big_prodMat(); it must be some mistake in your code.

lizlitkowski commented 3 years ago

Thanks for the quick reply. I didn't think there was an issue with big_prodMat(). I was trying to understand if it was just performing a simple calculation of the beta value*genotype for each SNP for each individual and then summing them up?

privefl commented 3 years ago

Yes, just a matrix product, equivalent to G %*% beta, and yes you need to sum these up if you do it per chromosome.

lizlitkowski commented 3 years ago

Great. Thank you!