Predictions on a large dataset are very slow

bakaburg1 commented 4 years ago

Hello,

First of all, many compliments for bartMachine, a really nice implementation of a wonderful algorithm.

I am using BART for potential outcomes causal inference, which is based on comparing Yhat at the observation level, predicted after assigning specific values to variable X for all observations while keeping the other covariates Z fixed at their original value. (https://nyuscholars.nyu.edu/en/publications/bayesian-nonparametric-modeling-for-causal-inference)

The problem is that my dataset is very large [27358 x 224] and predictions made with bart_machine_get_posterior simply take forever. Since I'll need to do this for 224 variables with multiple evaluated values each the analysis would take days.

Reading around the issues I saw that the Array version of BART would fix memory issues, but would it also fix speed ones? Or is there any setting in bartMachine which would make predictions faster? Considering that my problem is more prediction time than estimation time (for my model it takes ~ 30 mins), is there a way to balance speed toward the first?

The only alternative solution I could think of is to build the model on 2/3 of the dataset and estimate the variable effects on the other third.

Here are the arguments I use for the model:

bartMachine(X = X, y = Y,
verbose = T,
num_trees = 200,
num_iterations_after_burn_in = 5000,
run_in_sample = F,
mem_cache_for_speed = F, # Otherwise it crashes
use_missing_data = T, serialize = save)

And this is the code I use to estimate the Individual Treatment Effect (maybe some speedup is possible also here):

compute_BART_ITE <- function(bart.mod, data = NULL, vars = NULL, quants = c(.1, .3, .5, .7, .9)) {

    if (is.null(vars)) vars <- bart.mod$X %>% colnames()

    data <- if (is.null(data)) bart.mod$X else data %>% select(any_of(vars))

    lapply(vars, function(V) {
        print(glue("{which(V %in% vars)}/{length(vars)}: {V}"))

        if (n_distinct(data[[V]]) > 5 & is.numeric(data[[V]])) {
            pred.val <- quantile(data[[V]], quants, na.rm = T) %>% sort %>% signif(3)
        } else pred.val <- unique(data[[V]]) %>% sort

        data[[V]] <- pred.val[1]
        tictoc::tic('Computed reference matrix')

        ref.matrix <- bart_machine_get_posterior(bart.mod, new_data = data)$y_hat_posterior_samples

        tictoc::toc()

        pblapply(pred.val[-1], function(val) {
            data[[V]] <- val

            log(bart_machine_get_posterior(bart.mod, new_data = data)$y_hat_posterior_samples) - log(ref.matrix)
        }) %>% magrittr::set_names(paste(pred.val[-1], 'vs', pred.val[1]))

    }) %>% magrittr::set_names(vars)
}

kapelner commented 4 years ago

Hi Angelo,

Try these two simple things first: (1) Remove any magrittr piping inside the loop (they are slow). (2) Use "set_bart_machine_num_cores" to max out your CPU's. If you have more money than time, BART's prediction should be very close to being embarassingly parallel, so maybe an investment in CPU's is a good idea e.g. AWS's compute optimized servers with 64 vCPU's may be a good experiment.

If not, then yes, you should be splitting the data, building models on each split and then predicting on all models and averaging. Your accuracy will suffer especially if the model truly has 224 important variables. But I imagine splitting it down to 3,000 or so won't affect accuracy too much. I can finish that bart array function for you if you wish.

bakaburg1 commented 4 years ago

Dear Adam, thank you for your suggestions!

So there isn't any setting in the initial bartMachine call to influence this tradeoff?

I really should investigate more AWS, but I'm quite short on time so I cannot investigate how to move my whole project there. At the moment I am using 7 out 8 of my cores, for fear of freezing the computer if I do anything else (I work on my laptop, I know, not a very good idea), but I'll try with 8. The solution I'm following now is to bootstrap the data and use it as training sample and then use the remaining 36.8% out of sample observation as test set, to decrease the number of observations to predict on. Let's see how will it work.

kapelner commented 4 years ago

Not at the moment... but if you really just need to (a) build the model and (b) predict, then I can code up an array feature really quickly. That would decrease both the memory and CPU load (but you'd pay in accuracy).

You can also just train on a subset and predict on a subset. What does your problem require? How many predictions?

On Tue, Sep 29, 2020 at 3:00 PM bakaburg1 notifications@github.com wrote:

Dear Adam, thank you for your suggestions!

So there isn't any setting in the initial bartMachine call to influence this tradeoff?

I really should investigate more AWS, but I'm quite short on time so I cannot investigate how to move my whole project there. At the moment I am using 7 out 8 of my cores, for fear of freezing the computer if I do anything else (I work on my laptop, I know, not a very good idea), but I'll try with 8. The solution I'm following now is to bootstrap the data and use it as training sample and then use the remaining 36.8% out of sample observation as test set, to decrease the number of observations to predict on. Let's see how will it work.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kapelner/bartMachine/issues/34#issuecomment-700918028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFAV6GYJNVKEAWNZO4FDE3SIIVFZANCNFSM4R3VSNJA .

-- Adam Kapelner, Ph.D. Assistant Professor of Mathematics Director of the Undergraduate Data Science and Statistics Program Queens College, City University of New York 65-30 Kissena Blvd., Kiely Hall Room 604 Flushing, NY, 11367 M: 516-435-6795 kapelner.com (scholar https://scholar.google.com/citations?user=TzgMmnoAAAAJ|research gate http://www.researchgate.net/profile/Adam_Kapelner2|publons https://publons.com/researcher/431881/adam-kapelner/peer-review/|linkedin https://www.linkedin.com/in/adam-kapelner/)

bakaburg1 commented 4 years ago

Hello, sorry for the late reply (I'm really tight on the deadline for this project). The aim is a causal analysis over a couple hundred variables, estimating the Individual Treatment Effect first and from this the Average Treatment Effect and the Heterogeneous Treatment Effect. Eventually I employed a series of tricks to make it feasible on my computer:

Training on a bootstrapped version of the data and prediction (the expensive part) on ~ 36.8% out of bag observations.
Counterfactual prediction matrices are built for every predictor in a loop, and are saved out of memory using the bigmemory package and destroyed as soon as they are used to estimate ATE and HTE, to keep memory usage down (I was experiencing R exhausted memory errors).
Intermediate files are created after time expensive steps, to not have to start over in case the system fails for memory leakages or other problems.

kapelner commented 2 years ago

should be much faster now with v1.3

kapelner / bartMachine

Predictions on a large dataset are very slow #34