kapelner / bartMachine

An R-Java Bayesian Additive Regression Trees implementation
MIT License
62 stars 27 forks source link

Thinning? #1

Closed dchudz closed 10 years ago

dchudz commented 10 years ago

Have you thought about implementing thinning of the iterations? I'm thinking this would be a parameter that lets you specify that only every n iterations (e.g. n=10) are saved after burn-in.

I've only played with your implementation a little bit, but in a toy example, 10k post-burn-in iterations got me an effective sample size of around 250. I'd be almost as well off (and my memory would be much better off) keeping every 10th iteration.

(P.S. Thanks for doing this. I'm really excited about bartMachine!)

kapelner commented 10 years ago

Hi,

Thanks for the suggestion. I'm curious what you mean by "effective sample size of around 250."

Of course, we give you the ability to thin since we return the raw data to R and it can be thinned via something like:

post_samples[seq(1, n, by = 10), ]

Thinning and autocorrelation was never considered to be a problem in BART and this is something we went back and forth with the original authors about but we would like to hear your thoughts.

Adam

dchudz commented 10 years ago

Hi Adam--

Thanks for the reply. I'm trying to save memory with the trees, so thinning after getting posterior samples doesn't help me. To clarify more explicitly, there seems to be a lot of autocorrelation, so I need to run a lot of iterations, so I'm limited by memory to hold the fitted BART models. The kind of thinning I'm imagining would happen in the Java, only keeping the set of model parameters (trees and leafs) every nth iteration.

When I say "effective sample size", I mean sample size adjusted for autocorrelation (based on any of the parameters, e.g. sigma or predictions for one test case). I'm using the effectiveSize function in the coda package (which has a lot of useful functions for MCMC methods) for that. I'm also using Andrew Gelman's "R hat" metric (which compares between-chain and within-chain variance to assess mixing, as implemented in coda) to assess convergence.

I'm pasting below some code / output from a simple small example with 30 simulated features (whose true relationship to the target is linear). I believe it shows that the autocorrelation is large enough to explain why I want to create large number of chains, which is why I want to save memory by not keeping them all in the model fitting.

library(plyr)
library(bartMachine)
library(coda)

set.seed(0)
nTest <- 20
nTrain <- 400
sigmaTrue=3
isTrain <- sample(c( rep(TRUE, nTrain), rep(FALSE,nTest)))
nInputs <- 30

simulatedFeatures <- data.frame(llply(1:nInputs, function(i) rnorm(nTrain + nTest)))
mainEffectCoefficients <- rnorm(nInputs) # randomly chosen coefficients for main effect for each input (there will be no interactions)
Ey <- rep(0, nTrain + nTest) # initially E[y], expected value of y given the features and coefficients
for (i in 1:nInputs) {
  Ey <- Ey + simulatedFeatures[[i]]*mainEffectCoefficients[i]
}
y <- Ey + rnorm(nTrain+nTest, sd=sigmaTrue)

numIterationsAfterBurnIn <- 1500
numBurnIn <- 250
nChains=5

bartChains <- lapply(1:nChains,function(i) bartMachine(
  X = simulatedFeatures[isTrain,],
  y = y[isTrain],
  num_iterations_after_burn_in=numIterationsAfterBurnIn,
  num_burn_in = numBurnIn # I'll discard initial iterations myself later
))

Convert sigma posterior samples to mcmc.list, allowing me to use functions in coda:

sigmaByChain <- as.mcmc.list(llply(bartChains, function(oneChain) mcmc(get_sigsqs(oneChain))))

Plotting sigma for each chain shows a fair amount of autocorrelation:

plot(sigmaByChain, main = "Sigma")

screenshot 2014-07-02 13 54 09

As we add iterations, Gelman's Rhat goes toward 1 (which it should):

gelman.plot(sigmaByChain, main="Gelman Rhat Plot (using last half of iterations)")

screenshot 2014-07-02 13 54 39

Compute sample size adjusting for autocorrelation:

effectiveSize(sigmaByChain)

(Effectively only 156 samples, as you'd expect from how much autocorrelation we can see in the plot.)

Now I repeat the above for posterior predictions on one test case:

posteriorSamplesByChain <- llply(bartChains, function(oneChain) bart_machine_get_posterior(oneChain, simulatedFeatures[!isTrain,]))
posteriorSamplesOneTestCase <- as.mcmc.list(llply(posteriorSamplesByChain, 
                                                  function(oneChainPosteriorSamples) mcmc(oneChainPosteriorSamples$y_hat_posterior_samples[1,])))

plot(posteriorSamplesOneTestCase, main="One Test Case")

screenshot 2014-07-02 13 55 34

As we add iterations, Gelman's Rhat goes toward 1 (which it should):

gelman.plot(posteriorSamplesOneTestCase, main="Gelman Rhat Plot (using last half of iterations)")

screenshot 2014-07-02 13 55 10

effectiveSize(posteriorSamplesOneTestCase)

Effectively only 167 samples.

Thanks!

I apologize if I'm misunderstanding anything.

-David

P.S. I'm not necessarily asking for you to implement thinning, although that'd be great. I mainly want to communicate why I think it could be helpful. It's possible I or someone I work with would be able to contribute that, if you think it would be good.

kapelner commented 10 years ago

David,

Wow. You certainly have me convinced. You're right that this would have to be implemented on the Java side and it will be a lot of work. I'm unsure if predictions will get much better, but credible and predictive intervals would probably get better. A workaround now is to use many cores (which make many chains) and increase your RAM. Thanks for this whole analysis!

Adam

jbleich89 commented 10 years ago

Hi David,

Thanks for the analysis! We really appreciated it.

Just to echo what Adam said, most people are using BART for mean estimation only (aka treating the model as an algorithmic procedure vs. Bayesian model), so there has not been much issue on that end. Unfortunately, BART is both computationally AND memory intensive, so thinning just provides a tradeoff between time and memory. But I completely agree that it would be nice to offer the user the option on that end so they can make the choice based on their personal resources. Certainly something to think about for the future. Try the "num_cores" feature in our package for now to see if that helps reduce autocorrelation by internally building separate chains and aggregating them.

I saw that Gelman wrote about your new R package (congratulations on that!) for comparing black-box models. We too have had an interest in interpretation and visualization of black-box models, and recently wrote a paper (to appear in JCGS) and R package centered around expanding Friedman's partial dependence plots. Perhaps you would find this useful for your work or at least interesting in its own regard (shameless plug, I know!).

The package is called ICEbox and the arXiv copy of the paper can be found at: http://arxiv.org/pdf/1309.6392v2.pdf.

Thanks again for your ideas and analysis! Have a great holiday!

Best, Justin

On Wed, Jul 2, 2014 at 5:07 PM, David Chudzicki notifications@github.com wrote:

Hi Adam--

Thanks for the reply. I'm trying to save memory with the trees, so thinning after getting posterior samples doesn't help me. To clarify more explicitly, there seems to be a lot of autocorrelation, so I need to run a lot of iterations, so I'm limited by memory to hold the fitted BART models. The kind of thinning I'm imagining would happen in the Java, only keeping the set of model parameters (trees and leafs) every nth iteration.

When I say "effective sample size", I mean sample size adjusted for autocorrelation (based on any of the parameters, e.g. sigma or predictions for one test case). I'm using the effectiveSize function in the coda package (which has a lot of useful functions for MCMC methods) for that. I'm also using Andrew Gelman's "R hat" metric (as implemented in coda) to assess convergence.

I'm pasting below some code / output from a simple small example with 30 simulated features (whose true relationship to the target is linear). I believe it shows that the autocorrelation is large enough to explain why I want to keep large number of chains, which is why I want to save memory by not keeping them all in the model fitting.

library(plyr) library(bartMachine) library(coda)

set.seed(0) nTest <- 20 nTrain <- 400 sigmaTrue=3 isTrain <- sample(c( rep(TRUE, nTrain), rep(FALSE,nTest))) nInputs <- 30

simulatedFeatures <- data.frame(llply(1:nInputs, function(i) rnorm(nTrain + nTest))) mainEffectCoefficients <- rnorm(nInputs) # randomly chosen coefficients for main effect for each input (there will be no interactions) Ey <- rep(0, nTrain + nTest) # initially E[y], expected value of y given the features and coefficients for (i in 1:nInputs) { Ey <- Ey + simulatedFeatures[[i]]*mainEffectCoefficients[i] } y <- Ey + rnorm(nTrain+nTest, sd=sigmaTrue)

numIterationsAfterBurnIn <- 1500 numBurnIn <- 250 nChains=5

bartChains <- lapply(1:nChains,function(i) bartMachine( X = simulatedFeatures[isTrain,], y = y[isTrain], num_iterations_after_burn_in=numIterationsAfterBurnIn, num_burn_in = numBurnIn # I'll discard initial iterations myself later ))

Convert sigma posterior samples to mcmc.list, allowing me to use functions in coda:

sigmaByChain <- as.mcmc.list(llply(bartChains, function(oneChain) mcmc(get_sigsqs(oneChain))))

Plotting sigma for each chain shows a fair amount of autocorrelation:

plot(sigmaByChain, main = "Sigma")

[image: screenshot 2014-07-02 13 54 09] https://cloud.githubusercontent.com/assets/1222726/3462196/05ee4eb4-022b-11e4-875c-ffcc4f5fe44c.png

As we add iterations, Gelman's Rhat goes toward 1 (which it should):

gelman.plot(sigmaByChain, main="Gelman Rhat Plot (using last half of iterations)")

[image: screenshot 2014-07-02 13 54 39] https://cloud.githubusercontent.com/assets/1222726/3462205/1c3c43c4-022b-11e4-8c56-e2ed6c5eb9dc.png

Compute sample size adjusting for autocorrelation:

effectiveSize(sigmaByChain)

(Effectively only 156 samples, as you'd expect from how much autocorrelation we can see in the plot.)

Now I repeat the above for posterior predictions on one test case:

posteriorSamplesByChain <- llply(bartChains, function(oneChain) bart_machine_get_posterior(oneChain, simulatedFeatures[!isTrain,])) posteriorSamplesOneTestCase <- as.mcmc.list(llply(posteriorSamplesByChain, function(oneChainPosteriorSamples) mcmc(oneChainPosteriorSamples$y_hat_posterior_samples[1,])))

plot(posteriorSamplesOneTestCase, main="One Test Case")

[image: screenshot 2014-07-02 13 55 34] https://cloud.githubusercontent.com/assets/1222726/3462215/4a2ff6b8-022b-11e4-9ab9-d2dd83e9eee9.png

As we add iterations, Gelman's Rhat goes toward 1 (which it should):

gelman.plot(posteriorSamplesOneTestCase, main="Gelman Rhat Plot (using last half of iterations)")

[image: screenshot 2014-07-02 13 55 10] https://cloud.githubusercontent.com/assets/1222726/3462208/2ce552f6-022b-11e4-8470-1a98b7315a4d.png

effectiveSize(posteriorSamplesOneTestCase)

Effectively only 167 samples.

plot(posteriorSamplesOneTestCase, main="One Test Case")

Thanks!

I apologize if I'm misunderstanding anything.

-David

P.S. I'm not necessarily asking for you to implement thinning, although that'd be great. I mainly want to communicate why I think it could be helpful. It's possible I or someone I work with would be able to contribute that, if you think it would be good.

— Reply to this email directly or view it on GitHub https://github.com/kapelner/bartMachine/issues/1#issuecomment-47836765.

Justin L. Bleich Statistics PhD Candidate The Wharton School | University of Pennsylvania jbleich@wharton.upenn.edu | M: (732) 887-5131

jbleich@wharton.upenn.edu

rsparapa commented 7 years ago

Thinning is definitely best done on the java side. However, you can manage the memory by running bartMachine in a loop, thin each posterior and pull it all together. It will not be as efficient, but if you are running out of RAM (like me ;o) then this might help. Here is a wrapper that I created for classification which should give you an idea of what I am talking about.

pbartMachine=function( x.train, y.train, x.test=matrix(0.0,0,0), k=2.0, power=2.0, base=.95, ntree=200, numcut=100, ndpost=1000, nskip=250, keepevery=1, mc.cores=1, seed=99, alpha_0 = 1, do_ard = 1, do_prior = 1 ) { library(dartMachine)

set_bart_machine_num_cores(mc.cores)

n = length(y.train)

if(n!=nrow(x.train))
    stop('The length of y.train and the number of rows in x.train must be identical')

if(nrow(x.test)==0)
    stop('To generate the posterior, you must specify x.test')

post <- list()
post$x.test <- x.test
thin <- seq(keepevery, ndpost, by=keepevery)

for(h in 1:keepevery) {
    res = bartMachine(
        X=as.data.frame(x.train), y=factor(y.train),
        num_trees=ntree, num_burn_in=nskip,
        num_iterations_after_burn_in=ndpost,
        alpha=base, beta=power, k=k, 
        seed=seed, alpha_0 = alpha_0, do_ard = do_ard, do_prior = do_prior
    )

    if(h==1) {
        res$prob.test <- bart_machine_get_posterior(res, as.data.frame(x.test))$y_hat_posterior_samples[ , thin]
        post$prob.test <- t(res$prob.test)
    }
    else {
        res$prob.test <- bart_machine_get_posterior(res, as.data.frame(x.test))$y_hat_posterior_samples[ , thin]
        post$prob.test <- rbind(post$prob.test, t(res$prob.test))
    }

    rm(res)
    gc()
}

post$prob.test.mean <- apply(post$prob.test, 2, mean)

return(post)

}