jenfb / bkmr

Bayesian kernel machine regression
50 stars 19 forks source link

BKMR does not run with huge data. #36

Open LindaAmoafo opened 9 months ago

LindaAmoafo commented 9 months ago

Hello,

I am running BKMR models for the effect of multi-exposures on suicide. My design is a case-crossover design, so I use the id argument, and since suicide is binary, I set family = "binomial". When I began to run, I encountered errors, but I saw an issue that you said to set est.h = TRUE for the binomial family which fixed my errors. However, I have a huge data, ~ 40,000 rows of which I have 9000 unique subjects when I run the BKMR on the data, the only message on the console is "Fitting probit regression model Validating control. params...", I left it to run for a full day, which kept on running, but nothing seemed to happen.

When I randomly select 1000 rows, it runs but if I randomly select 5000 rows instead, nothing runs.

Can you help me go about this issue?

Thank you for your time.

jenfb commented 9 months ago

That is a quite large dataset size. Did you try using the knots argument to see if it runs more quickly?

On Wed, Nov 29, 2023 at 4:09 PM LindaAmoafo @.***> wrote:

Hello,

I am running BKMR models for the effect of multi-exposures on suicide. My design is a case-crossover design, so I use the id argument, and since suicide is binary, I set family = "binomial". When I began to run, I encountered errors, but I saw an issue that you said to set est.h = TRUE for the binomial family which fixed my errors. However, I have a huge data, ~ 40,000 rows of which I have 9000 unique subjects when I run the BKMR on the data, the only message on the console is "Fitting probit regression model Validating control. params...", I left it to run for a full day, which kept on running, but nothing seemed to happen.

When I randomly select 1000 rows, it runs but if I randomly select 5000 rows instead, nothing runs.

Can you help me go about this issue?

Thank you for your time.

— Reply to this email directly, view it on GitHub https://github.com/jenfb/bkmr/issues/36, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPWE2RE2ERU4G3ZGEDDSSTYG7FEBAVCNFSM6AAAAABAAKCGNOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYTONRTHE4DANA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

LindaAmoafo commented 9 months ago

I am implementing models with a random intercept; however, the knots argument is currently only implemented for models without a random intercept.

LindaAmoafo commented 9 months ago

Hello Jennifer, I'd like to discuss a strategy I've been working on. In an attempt to address this particular challenge, I am trying to implement a sort of ensemble method. I randomly divided the data to create multiple cohorts containing no more than 1400 rows.

Subsequently, I ran each cohort independently on a high-performance computer, generating estimates with 30000 posterior draws for each cohort. I performed a burn-in on the initial 15000 draws and combined the remaining 15000 from each cohort to create new posterior samples. These samples were then used to calculate posterior estimates, including mean and confidence intervals.

While this approach aims to bring me closer to the estimation using the full cohort, I've encountered an issue with obtaining h.hat and ystar. H.hat is typically estimated by contrasting an individual's exposure trajectory with that of others, but in this case, it involves only subjects within the specific cohort. The same challenge extends to the calculation of ystar, which relies on h.hat.

To address this, I reviewed your GitHub codes and attempted to recalculate h.hat and ystar using combined posterior samples for r, lambda, sigmaeq, and the complete dataset. I utilized the Vcomp, h.update, and ystar.update functions. Unfortunately, the computational demands, especially the Cholesky decomposition in Vcomps and matrix multiplications in h.update, are causing significant delays. With my large dataset, even running with just one posterior sample takes an impractical amount of time, making it infeasible to recalculate h and ystar.

Currently, my consideration is to simply cbind the h.hat estimates from different cohorts and do the same for ystar. I would appreciate your insights on the potential deviation in the estimation of h and ystar resulting from this approach.

By combining the data in this manner, I am also aware that the number of posterior samples for h.hat and ystar will be notably smaller compared to other parameters. If I proceed with this combination and use your summary functions, do you foresee any issues arising from the disparate sizes of the posterior samples?

Thank you for taking the time to consider and respond to my inquiry.

zyddmn commented 3 months ago

I also encountered the same issue. When I run data with 1000 population and with 5 repeated measures, the package cannot process. When I reduce the population, the package can run successfully. I first thought it was the memory issue, but when I tried the package on HPC with large memory setting, the code still cannot proccess.