privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

indexing beta_auto in per chromosome computation #409

Closed TC-gu closed 1 year ago

TC-gu commented 1 year ago

Hi Florian,

I am trying to adapt one of your tutorials to my data set: https://github.com/privefl/paper-infer/blob/main/code/example-with-provided-LD.R

For now I cannot have my genotypes combined (the .bk gets too big), so I am trying to run it per chromosome and sum the pred_auto scores like you suggested in one of the other issues. I have over 20000 subjects so I want to use the pre-computed LD correlation matrix.

All the computations until getting beta_auto <- rowMeans(sapply(multi_auto[keep], function(auto) auto$beta_est)) work fine since they don't use my genotype data. The problem starts when the step

map_pgs <- df_beta[1:4]; map_pgs$beta <- 1
map_pgs2 <- snp_match(map_pgs, map)

where I actually have to use map_pgs[map_pgs$chr==chrom,] (otherwise the error "not enough variants are matched" shows up) and then since not all variants are matched I don't know how to later on select the right part of beta_auto in:

pred_auto <- big_prodVec(G, beta_auto * map_pgs2$beta,
                         ind.col = map_pgs2[["_NUM_ID_"]],
                         ncores = NCORES)

I cannot use the indexing based on df_beta (variants from the correct chromosome, since not all are in beta_auto) and I don't see a why how to properly split beta_auto into chromosome chunks.

Do you have any suggestions?

Generally (but that is another issue) I would rather run everything on the whole genotype file, but then the .bk file gets over 1000GB and there are some memory issues coming up. Is there a way to prune my initial combined .bed file so that the "correct" variants remain? Alternatively - can I combine my chromosome data on the level of the bigSNP object (after using "snp_attach()" for each chromosome)? I couldn't find any function in the library.

Looking forward to any suggestions.

Best wishes, Tomasz

TC-gu commented 1 year ago

Additionally, when I try to compute everything on a chromosome level from the start, so using info_snp <- snp_match(sumstats[sumstats$chr==chrom,], map_ldref[map_ldref$chr==chrom]) and then only single corr instead of the for() loop:

corr_chr <- readRDS(paste0("ldref_hm3_plus/LD_with_blocks_chr",chrom,".rds"))[ind.chr2, ind.chr2]
    corr <- as_SFBM(corr_chr, tmpfull, compact = TRUE)

(I think in that case ind.chr2 are the correct indices but I am not 100% sure)

Then for all the multi_auto chains the path_h2_est is constant and equal 0.001 (regardless of the h2_init value).

That does not seem correct...?

privefl commented 1 year ago

Your issue is just at the step of computing the polygenic scores for your dataset from the effect sizes you got from LDpred2-auto, right?

If you have

map_pgs <- cbind(df_beta[1:4], beta = beta_auto)
map_pgs2 <- snp_match(map_pgs, map)

Then, you can either:

I think PLINK would use a better strategy to handle the missing values in the bed file (which I should implement at some point, cf. https://github.com/privefl/bigsnpr/issues/298). I usually work with already imputed data.

TC-gu commented 1 year ago

Hi again,

Yes, that was the issue. Was there a reason in the tutorials why you don't match beta_auto already with map_pgs and instead use *beta_automatch$beta** in big_prodVec? Using (for chr 22 as example):

map_pgs <- cbind(df_beta[1:4],beta=beta_auto)
match1 <- snp_match(map_pgs[map_pgs$chr==22,], map)
pred_auto1 <- big_prodVec(G_nonmis, match1$beta,
                          ind.col = match1[["_NUM_ID_"]],
                          ncores = NCORES)

seems to work and I think the correct indices are used. The other change is that I have to use

match1$beta instead of *beta_automatch1$beta** (where match1$beta was previously = 1) in big_prodVec() for y.col.

So just to be completely sure - if I run that for each chromosome and then sum up pred_auto for each individual I should get the same results as if I run it on the combined genotype (.bed file with all chromosomes)?

Thanks a lot for a quick response!

privefl commented 1 year ago

There is no difference (for the betas), both should be the same.

Yes, you can just sum the 22 PGS from each chromosome.

TC-gu commented 1 year ago

Thanks a lot, I had multiple issues trying to get results at first but finally everything seems quite clear. On a general note, I found it quite difficult to get around all the indexing in various steps (especially that I had to modify the tutorials to have it per chromosome). If I might have a suggestion, it would be helpful point out what are the _NUM_ID_ and _NUM_ID_.sscolumn generated by snp_match(). While ".ss" indicates sumstats it is actually dependent on the order of arguments provided to snp_match()(which now that I lookup the help file again states that "sumstats" is the first argument, but there is no information in the help file about the _NUMID columns, I think adding that would help to understand the indexing better).

privefl commented 1 year ago

You're probably right.

If you know how to create a PR, you can suggest some change to the documentation. Otherwise, I'll try to do it in a future version.

TC-gu commented 1 year ago

I'm quite a noob when it comes to github stuff, I'm not sure how to do that.