lmc2179 / bayesian_bootstrap

bayesian bootstrapping in python
MIT License
121 stars 17 forks source link

Running out of memory during creation of weights for bootstrap replications #1

Closed roya0045 closed 7 years ago

roya0045 commented 7 years ago

Greetings,

I'm trying to use the model (downloaded from pip on python 3.5).

When executing the fit function I get a memory error during the weight creation process (line 95 in bootstrap.py)

here is the error(yes there is nothing after MemoryError:):

mtrand.pyx in mtrand.RandomState.dirichlet (numpy\random\mtrand\mtrand.c:36817)()

MemoryError:

lmc2179 commented 7 years ago

Strange! Would you post the code, and if possible some of the data you're using when you call the fit function?

roya0045 commented 7 years ago

I think I'm running out of memory but I don't have much else running when I'm doing my tests and I have 4gb on this setup.

The bootstrap implementation is part of a class:

    def gen_boostr(self,data=None,size=data.shape[0],resamp=randint(0,10)):
        mel=[slm.LinearRegression,
            slm.Lars,slm.BayesianRidge,
            slm.Lasso,slm.ElasticNet,
            slm.LogisticRegression,
            slm.LassoLars,
            slm.PassiveAggressiveRegressor,
            slm.Perceptron,
            slm.Ridge,
            slm.SGDRegressor]
        methodd=mel[self.b_method % len(mel)]
        if self.bootstrap is None:
            if data is None:
                size=self.x.shape[0]
                print(size)
            self.bootstrap= bb.BayesianBootstrapBagging(methodd(),size,resamp)
        if data is None:
            self.bootstrap.fit(X=self.x,y=self.y)
        else:
            self.bootstrap.fit(X=data[:,:-1],y=data[:,-1])

Currently Linear regression is the method in use.

The dataset is this one solar radiation dataset

I have isolated only the column Pressure(x) and Speed(y): (pandas read_csv -> data=csv[["Pressure","Speed"]] in the init function self.x=data.iloc[:,:-1] , self.y=data.iloc[:,-1] )

I don't get why I would get an out of memory error, I also get one using gaussian process when the empty array creator is called.

roya0045 commented 7 years ago

I also have a question that you might be able to answer. So the goal of this model is to predict the error of an ouput of a machine learning algorithm. Considering the error range that we can have for each X (predicted value) and the error associated with the prediction, I want to make a model using the bayesian approach (notably bayesian bootstraping) to obtain the prediction lines for each prediction. Then I want to make a cross section of that plot to obtain an error range for a given prediction ( the equivalent of plotting the density function of the errors for a given predicted value).

Do you think bayesian bootstraping is flexible enough to give good results?

lmc2179 commented 7 years ago

Hm, it looks like you're running out of memory when you try to run the first part of the sampling process - there may be a workaround, but I should probably add an enhancement that uses constant memory at the expense of speed. I'll get back to you about the best way to do the workaround.

Yes, I think that should work - if I understand you correctly, you're looking for confidence intervals around E[y|X] for a bunch of different X values, similar to the regression modeling example in the README.

roya0045 commented 7 years ago

Exactly! Though the issue would be that linearity constrains may not be the best things but it's a start.

Thanks for the fast replies, if you need help I can try to help in python.

lmc2179 commented 7 years ago

Sure - I'm a little busy until Wednesday, but I can give you a place to get started if you want to try a fix yourself. If you end up implementing a fix, feel free to run the unit tests and open a pull request, I'd happily take your contribution on board.

The issue, as you've pointed out, is at weights = np.random.dirichlet([1]*len(X), n_replications). That line is creating a big matrix with weights for each datapoint in each replication. If you don't pre-generate those weights, but instead create a vector for each replication inside the for loop that follows, you should use less memory (though it will be slower as you make repeated calls to dirichlet). Basically, that means making many calls that looks like np.random.dirichlet([1]*len(X)), rather than one big call.

I'm also eventually going to implement a linear regression method like the mean, var and covar methods which will be able to linear regression without too much painful resampling. The resampling method is very general, but pretty inefficient, unfortunately.

roya0045 commented 7 years ago

I've implemented a low memory alternative in my branch but I'm not sure how to do the testing, I've executed the modified test_bootstrap.py and I got errors. The loop is setup, I have added a variable in the bootstrap method, the bagging method and in the class. I have also added the ifmain function to the test unit.

Though when I run the test I get the following errors when the low memory loop are called. sample_index = np.random.choice(range(len(X)), p=w, size=resample_size) File "mtrand.pyx", line 1400, in mtrand.RandomState.choice (numpy\random\mtrand\mtrand.c:16557) TypeError: object of type 'numpy.float64' has no len() and resample_i = np.random.choice(range(len(X_arr)), p=w, size=resample_size) File "mtrand.pyx", line 1400, in mtrand.RandomState.choice (numpy\random\mtrand\mtrand.c:16557) TypeError: object of type 'numpy.float64' has no len() I'm not familiar enough with the internals of the program. I'll await your feedback before altering anything to fix this.

lmc2179 commented 7 years ago

Hm. Okay - the code in your forked version is up to date? I'll take a look at it when I get the chance.

roya0045 commented 7 years ago

Yes my fork was created one or two day ago, so the code should be up to date.

I'll await further instructions.

lmc2179 commented 7 years ago

Okay, I have a quick fix that should get you past that line, at least. Replace the problematic line with: weights = (np.random.dirichlet([1]*len(X)) for _ in range(n_replications)) That should generate the weights lazily, instead of generating them in memory all at once. It should get you past that line, at least, which can be a memory hog - try it and we'll see if the problem goes deeper than that.

roya0045 commented 7 years ago

All tests passed.

lmc2179 commented 7 years ago

Okay, great - and it works with your dataset too? If so, I'll add an option for that in the next release.

Sent from my iPhone

On Aug 31, 2017, at 5:35 AM, Alex notifications@github.com wrote:

All tests passed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

roya0045 commented 7 years ago

It seems to be running though quite slow to make the dirichlet. I didn't get an issue, I'll send the pull request your way.

Edit:At first the testing was quite slow but I inverted 2 variables when creating the bagging instance and the num of replication was 32000+, so that's why it was running slowly.

lmc2179 commented 7 years ago

This issues has been addressed with the acceptance of your pull request. Thanks again for your help! If you end up finding something interesting with your analysis and want to provide some code in a markdown file or ipython notebook, I'd be happy to add it as a case study for this library, just put in another pull request.

roya0045 commented 7 years ago

I'm currently fiddling with swarm optimization but as soon as that is finished I'll try to take a look at the source code and do some testing to see if other things can be optimized or to provide some examples, I won't guarantee anything though.

As stated previously if you already have something in mind that could be improved for better performance or more flexibility feel free to tell me and I'll try to come up with something!