Closed TC-gu closed 1 year ago
Additionally, when I try to compute everything on a chromosome level from the start, so using
info_snp <- snp_match(sumstats[sumstats$chr==chrom,], map_ldref[map_ldref$chr==chrom])
and then only single corr instead of the for() loop:
corr_chr <- readRDS(paste0("ldref_hm3_plus/LD_with_blocks_chr",chrom,".rds"))[ind.chr2, ind.chr2]
corr <- as_SFBM(corr_chr, tmpfull, compact = TRUE)
(I think in that case ind.chr2 are the correct indices but I am not 100% sure)
Then for all the multi_auto chains the path_h2_est is constant and equal 0.001 (regardless of the h2_init value).
That does not seem correct...?
Your issue is just at the step of computing the polygenic scores for your dataset from the effect sizes you got from LDpred2-auto, right?
If you have
map_pgs <- cbind(df_beta[1:4], beta = beta_auto)
map_pgs2 <- snp_match(map_pgs, map)
Then, you can either:
snp_readBed2()
(specifying ind.col
) and then impute (if you have missing values), then use big_prodVec()
bed_prodVec()
(again, specifying ind.col
)map_pgs
to a file (e.g. with bigsnpr:::write.table2()
) and use PLINK to compute the scoresI think PLINK would use a better strategy to handle the missing values in the bed file (which I should implement at some point, cf. https://github.com/privefl/bigsnpr/issues/298). I usually work with already imputed data.
Hi again,
Yes, that was the issue. Was there a reason in the tutorials why you don't match beta_auto already with map_pgs and instead use *beta_automatch$beta** in big_prodVec? Using (for chr 22 as example):
map_pgs <- cbind(df_beta[1:4],beta=beta_auto)
match1 <- snp_match(map_pgs[map_pgs$chr==22,], map)
pred_auto1 <- big_prodVec(G_nonmis, match1$beta,
ind.col = match1[["_NUM_ID_"]],
ncores = NCORES)
seems to work and I think the correct indices are used. The other change is that I have to use
match1$beta instead of *beta_automatch1$beta** (where match1$beta was previously = 1) in big_prodVec() for y.col.
So just to be completely sure - if I run that for each chromosome and then sum up pred_auto for each individual I should get the same results as if I run it on the combined genotype (.bed file with all chromosomes)?
Thanks a lot for a quick response!
There is no difference (for the betas), both should be the same.
Yes, you can just sum the 22 PGS from each chromosome.
Thanks a lot, I had multiple issues trying to get results at first but finally everything seems quite clear.
On a general note, I found it quite difficult to get around all the indexing in various steps (especially that I had to modify the tutorials to have it per chromosome). If I might have a suggestion, it would be helpful point out what are the _NUM_ID_
and _NUM_ID_.ss
column generated by snp_match()
. While ".ss" indicates sumstats it is actually dependent on the order of arguments provided to snp_match()
(which now that I lookup the help file again states that "sumstats" is the first argument, but there is no information in the help file about the _NUMID columns, I think adding that would help to understand the indexing better).
You're probably right.
If you know how to create a PR, you can suggest some change to the documentation. Otherwise, I'll try to do it in a future version.
I'm quite a noob when it comes to github stuff, I'm not sure how to do that.
Hi Florian,
I am trying to adapt one of your tutorials to my data set: https://github.com/privefl/paper-infer/blob/main/code/example-with-provided-LD.R
For now I cannot have my genotypes combined (the .bk gets too big), so I am trying to run it per chromosome and sum the pred_auto scores like you suggested in one of the other issues. I have over 20000 subjects so I want to use the pre-computed LD correlation matrix.
All the computations until getting
beta_auto <- rowMeans(sapply(multi_auto[keep], function(auto) auto$beta_est))
work fine since they don't use my genotype data. The problem starts when the stepwhere I actually have to use map_pgs[map_pgs$chr==chrom,] (otherwise the error "not enough variants are matched" shows up) and then since not all variants are matched I don't know how to later on select the right part of beta_auto in:
I cannot use the indexing based on df_beta (variants from the correct chromosome, since not all are in beta_auto) and I don't see a why how to properly split beta_auto into chromosome chunks.
Do you have any suggestions?
Generally (but that is another issue) I would rather run everything on the whole genotype file, but then the .bk file gets over 1000GB and there are some memory issues coming up. Is there a way to prune my initial combined .bed file so that the "correct" variants remain? Alternatively - can I combine my chromosome data on the level of the bigSNP object (after using "snp_attach()" for each chromosome)? I couldn't find any function in the library.
Looking forward to any suggestions.
Best wishes, Tomasz