Closed roya0045 closed 7 years ago
Strange! Would you post the code, and if possible some of the data you're using when you call the fit function?
I think I'm running out of memory but I don't have much else running when I'm doing my tests and I have 4gb on this setup.
The bootstrap implementation is part of a class:
def gen_boostr(self,data=None,size=data.shape[0],resamp=randint(0,10)):
mel=[slm.LinearRegression,
slm.Lars,slm.BayesianRidge,
slm.Lasso,slm.ElasticNet,
slm.LogisticRegression,
slm.LassoLars,
slm.PassiveAggressiveRegressor,
slm.Perceptron,
slm.Ridge,
slm.SGDRegressor]
methodd=mel[self.b_method % len(mel)]
if self.bootstrap is None:
if data is None:
size=self.x.shape[0]
print(size)
self.bootstrap= bb.BayesianBootstrapBagging(methodd(),size,resamp)
if data is None:
self.bootstrap.fit(X=self.x,y=self.y)
else:
self.bootstrap.fit(X=data[:,:-1],y=data[:,-1])
Currently Linear regression is the method in use.
The dataset is this one solar radiation dataset
I have isolated only the column Pressure(x) and Speed(y): (pandas read_csv -> data=csv[["Pressure","Speed"]] in the init function self.x=data.iloc[:,:-1] , self.y=data.iloc[:,-1] )
I don't get why I would get an out of memory error, I also get one using gaussian process when the empty array creator is called.
I also have a question that you might be able to answer. So the goal of this model is to predict the error of an ouput of a machine learning algorithm. Considering the error range that we can have for each X (predicted value) and the error associated with the prediction, I want to make a model using the bayesian approach (notably bayesian bootstraping) to obtain the prediction lines for each prediction. Then I want to make a cross section of that plot to obtain an error range for a given prediction ( the equivalent of plotting the density function of the errors for a given predicted value).
Do you think bayesian bootstraping is flexible enough to give good results?
Hm, it looks like you're running out of memory when you try to run the first part of the sampling process - there may be a workaround, but I should probably add an enhancement that uses constant memory at the expense of speed. I'll get back to you about the best way to do the workaround.
Yes, I think that should work - if I understand you correctly, you're looking for confidence intervals around E[y|X] for a bunch of different X values, similar to the regression modeling example in the README.
Exactly! Though the issue would be that linearity constrains may not be the best things but it's a start.
Thanks for the fast replies, if you need help I can try to help in python.
Sure - I'm a little busy until Wednesday, but I can give you a place to get started if you want to try a fix yourself. If you end up implementing a fix, feel free to run the unit tests and open a pull request, I'd happily take your contribution on board.
The issue, as you've pointed out, is at weights = np.random.dirichlet([1]*len(X), n_replications)
. That line is creating a big matrix with weights for each datapoint in each replication. If you don't pre-generate those weights, but instead create a vector for each replication inside the for
loop that follows, you should use less memory (though it will be slower as you make repeated calls to dirichlet
). Basically, that means making many calls that looks like np.random.dirichlet([1]*len(X))
, rather than one big call.
I'm also eventually going to implement a linear regression method like the mean
, var
and covar
methods which will be able to linear regression without too much painful resampling. The resampling method is very general, but pretty inefficient, unfortunately.
I've implemented a low memory alternative in my branch but I'm not sure how to do the testing, I've executed the modified test_bootstrap.py and I got errors. The loop is setup, I have added a variable in the bootstrap method, the bagging method and in the class. I have also added the ifmain function to the test unit.
Though when I run the test I get the following errors when the low memory loop are called.
sample_index = np.random.choice(range(len(X)), p=w, size=resample_size) File "mtrand.pyx", line 1400, in mtrand.RandomState.choice (numpy\random\mtrand\mtrand.c:16557) TypeError: object of type 'numpy.float64' has no len()
and
resample_i = np.random.choice(range(len(X_arr)), p=w, size=resample_size) File "mtrand.pyx", line 1400, in mtrand.RandomState.choice (numpy\random\mtrand\mtrand.c:16557) TypeError: object of type 'numpy.float64' has no len()
I'm not familiar enough with the internals of the program. I'll await your feedback before altering anything to fix this.
Hm. Okay - the code in your forked version is up to date? I'll take a look at it when I get the chance.
Yes my fork was created one or two day ago, so the code should be up to date.
I'll await further instructions.
Okay, I have a quick fix that should get you past that line, at least. Replace the problematic line with:
weights = (np.random.dirichlet([1]*len(X)) for _ in range(n_replications))
That should generate the weights lazily, instead of generating them in memory all at once. It should get you past that line, at least, which can be a memory hog - try it and we'll see if the problem goes deeper than that.
All tests passed.
Okay, great - and it works with your dataset too? If so, I'll add an option for that in the next release.
Sent from my iPhone
On Aug 31, 2017, at 5:35 AM, Alex notifications@github.com wrote:
All tests passed.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
It seems to be running though quite slow to make the dirichlet. I didn't get an issue, I'll send the pull request your way.
Edit:At first the testing was quite slow but I inverted 2 variables when creating the bagging instance and the num of replication was 32000+, so that's why it was running slowly.
This issues has been addressed with the acceptance of your pull request. Thanks again for your help! If you end up finding something interesting with your analysis and want to provide some code in a markdown file or ipython notebook, I'd be happy to add it as a case study for this library, just put in another pull request.
I'm currently fiddling with swarm optimization but as soon as that is finished I'll try to take a look at the source code and do some testing to see if other things can be optimized or to provide some examples, I won't guarantee anything though.
As stated previously if you already have something in mind that could be improved for better performance or more flexibility feel free to tell me and I'll try to come up with something!
Greetings,
I'm trying to use the model (downloaded from pip on python 3.5).
When executing the fit function I get a memory error during the weight creation process (line 95 in bootstrap.py)
here is the error(yes there is nothing after MemoryError:):
mtrand.pyx in mtrand.RandomState.dirichlet (numpy\random\mtrand\mtrand.c:36817)()
MemoryError: