privefl / bigstatsr

R package for statistical tools with big matrices stored on disk.
https://privefl.github.io/bigstatsr/
179 stars 30 forks source link

big_prodMat throws takes a really long time and finally throws an out-of-memory error #184

Closed shachideshpande closed 3 months ago

shachideshpande commented 3 months ago

Hi Florian,

When I use the big_prodMat (or big_prodVec) function for HM3+ subset of UKBB variants (0.96 million SNPs), I see that it takes a really long time (10+ minutes) on a 24 core/100GB RAM node and ultimately throws out-of-memory error.

I call the function in the following manner:

big_prodMat(G_imputed, posterior_beta_samples[, 1:500], ind.row = train_indices[1:1000], ind.col=df_beta[['_NUM_ID_']], ncores=NCORES)

Here, posterior_beta_samples come from the following line (best_param chosen using validation dataset or set to true h2/p)

posterior_beta_samples  = snp_ldpred2_grid(
  corr, df_beta, best_param, ncores = NCORES, burn_in=100, ind.corr=corr_indices,
  return_sampling_betas = TRUE, num_iter = 500)

I wonder if the data type wasn't appropriate for any of the arguments. class(G_imputed) prints FBM.

privefl commented 3 months ago

For large matrix computations, please try setting bigparallelr::set_blas_ncores(NCORES) beforehand, and then running big_prodMat() with ncores = 1, as advised in the documentation.

privefl commented 3 months ago

And yes, big_prodMat() will take some time (maybe one hour), as any other function would, given it has to access data from disk for G. It will be fast only if you have enough memory to cache the entire file (or the subset of columns you're using) and repeatingly using this (then only the first time is read from disk).

shachideshpande commented 3 months ago

Thank you! I will try this and get back to you. I have been facing some trouble starting jobs on the cluser today, but should be able to try tomorrow.

shachideshpande commented 3 months ago

I tried the above suggestion and it does help avoid the memory error. I have a follow-up question: my code calls snp_ldpred2_auto and big_profMat for 10 experimental replicates. Once I set bigparallelr::set_blas_ncores(NCORES) before running big_prodMat, do I need to revert or change anything when running snp_ldpred2_auto for the next experimental repetition? Specifically, should NCORES be set to 1 for other functions like snp_ldpred2_auto and snp_ldpred2_grid once we set bigparallelr::set_blas_ncores(NCORES)?

privefl commented 3 months ago

Normally, you'll get an error in my packages when 2 different parallelization are used at the same time.

And LDpred2-auto should not really use matrix operations much, so the BLAS parallelization should not happen.

But the memory issue is weird. You have tried using ncores = 1 in big_prodMat(), right?

shachideshpande commented 3 months ago

Thank you! I have tried ncores=1 in big_prodMat(). Setting bigparallelr::set_blas_ncores(NCORES) and then setting ncores=1 in big_prodMat seemed to help avoid the memory issue I was facing. If this issue occurs again, I will reopen this conversation. Thank you for the help!