Using pymc3 for Kaggle - Githubissues

AdityaGudimella commented 8 years ago

Is anyone interested in using pymc3 for a kaggle competition? It's a really good way to showcase the abilities of the library. The devs of the keras library did the same thing with keras. There's a good competition up on kaggle right now. Grupo Bimbo Inventory Demand

twiecki commented 8 years ago

Good idea, maybe some people here would like to collaborate?

twiecki commented 8 years ago

A Bayesian model could work quite well here. I'm sure there are hierarchies that could be exploited. Uncertainty is likely to play a big factor too.

AdityaGudimella commented 8 years ago

I'm willing to help out too. Let's form a team on Kaggle. I thought so too. A couple of years ago there was apparently a guy on Kaggle who used to do really well with Bayesian methods. We should try something out too. My email id is aditya.gudimella@gmail.com Please mail me if you're interested in doing this competition together.

PaulSorenson commented 8 years ago

I would be interested. I have been off the grid re pymc3 for ages after relocating half way around the world.

paulhendricks commented 8 years ago

I would be interested in as well.

twiecki commented 8 years ago

I started a new repo and invitied you all to the kaggle group with write-permissions to: https://github.com/pymc-devs/kaggle_grupo

twiecki commented 8 years ago

chat room: https://gitter.im/pymc-devs/kaggle_grupo

ewharton commented 8 years ago

I've been wanting to work with pymc3 for a while. Can I join also?

twiecki commented 8 years ago

@ewharton added

alexandrudaia commented 8 years ago

Count me in also , if somewone interested drop me a message at alexandru130586@yandex.com

twiecki commented 8 years ago

@alexandrudaia added.

alexandrudaia commented 8 years ago

Tnks!

beyhangl commented 8 years ago

please add me beyhangl@gmail.com

twiecki commented 8 years ago

Please join https://gitter.im/pymc-devs/kaggle_grupo

paulhendricks commented 8 years ago

@twiecki Was this group removed?

twiecki commented 8 years ago

Yes, we've moved to a private repo, check the gitter chatroom.

alexandrudaia commented 8 years ago

Hi all I manage to see now the private repo , the question is how can I enter the chat room for this?Tnks!

twiecki commented 8 years ago

Oops, I didn't realize I would delete the chatroom along with it. Can we add a new one for the new repo?

alexandrudaia commented 8 years ago

Yes, in case this is impossible we can make a slack channel for this issue.

twiecki commented 8 years ago

:+1:

alexandrudaia commented 8 years ago

Ok, so should I setup a slack channel tommorow?

twiecki commented 8 years ago

Yes.

On Wed, Jun 15, 2016 at 3:40 PM, Daia Alexandru notifications@github.com wrote:

Ok, so should I setup a slack channel tommorow?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pymc-devs/pymc3/issues/1167#issuecomment-226297118, or mute the thread https://github.com/notifications/unsubscribe/AApJmAYHrbPrIBu9a2q-RLlKJ-xUrbJXks5qMFUUgaJpZM4IxPlh .

alexandrudaia commented 8 years ago

I made slack channel please send me your email so I can drop invitations at alexandru130586@yandex.com

twiecki commented 8 years ago

Seems like the slack was deleted?

alexandrudaia commented 8 years ago

guys sorry for late reply it seems i wanted to delete my first bilboa channel and delelted accidentaly , will send invites as soon as i reach home , sorry for unpleasure

walterreade commented 8 years ago

Hey Guys,

I worked on a team of 10 for the Kaggle DSBII challenge. It was very difficult keeping coordinated.

With that said, I've very interested in learning how use pymc3 for Kaggle. I tried a couple times unsuccessfully (never could get things to scale).

If there's a spot on the team, I'd be interested in joining.

The downside:

Total team size
I'm going to be very busy in July

The upside:

I'm a kaggle master (rank 86)
I came in 38 / 3303 in the Kaggle Rossmann challenge (another large forecasting contest)
I could work on non-Bayesian methods (e.g., xgboost) in parallel that we could potentially ensemble

It's a late request, so I'm totally cool if it's a no-go.

Walter (aka inversion)

On Tue, Jun 21, 2016 at 5:27 AM Daia Alexandru notifications@github.com wrote:

guys sorry for late reply it seems i wanted to delete my first bilboa channel and delelted accidentaly , will send invites as soon as i reach home , sorry for unpleasure

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pymc-devs/pymc3/issues/1167#issuecomment-227424277, or mute the thread https://github.com/notifications/unsubscribe/AETcGVHV5CYnIQDWxOAZjCg0RAcr1y5qks5qN9imgaJpZM4IxPlh .

alexandrudaia commented 8 years ago

Hello Mr Inversion , I am having same issue with using PyMc for kaggle , in fact I have not managed to do any submission for kaggle using PyMc , I tried some submissions using usual ml things but with no succes because I ended up getting stuck with dimmensionality of data , disabling me to do experiments, and sampling gived me bad score meaning 0.51 on local evaluation and 0.58 on LB. I ended up averaging some public scripts until now :(

fonnesbeck commented 8 years ago

The Kaggle appears to be up and running, so closing as an issue.

ghost commented 7 years ago

Hi, I have used PyMC3 + Bayesian Logistic Regression extensively on the NIH seizure prediction contest on Kaggle. Making it work was VERY difficult compared with using sk-learn's pure logistic regression and theano has several limitations which I had reported.

I am willing to upload the notebook and corroborate/improve the code.

I Have started exploring Edward and Stan as well for Kaggle.

fonnesbeck commented 7 years ago

Thanks for the feedback. What in particular was making it difficult? Expressing the model in PyMC, or getting the model to fit. Happy to peek at the code.

ghost commented 7 years ago

See https://github.com/QuantScientist/kaggle-seizure-prediction-challenge-2016/blob/master/jupyter/bayesian-logistic-regression/ieegPymc3Version1.ipynb

fonnesbeck commented 7 years ago

Looks like the algorithm could not get started. I'd be interested to see if it worked with the current release and the NUTS sampler, which gets initialized using ADVI. It should be easy with such a simple model.

Were there missing data in either the predictors or the outcomes? That will cause problems if not explicitly dealt with.

ghost commented 7 years ago

Hi, thanks, no missing data at all, everything is numerical, the data set was uploaded to github (hdf format) if you want to inspect it. Other than that, i would love to know how to provide the output from regular LogisticRegression() as an input to the MAP starting weights. Next week I am presenting PyMC3 and PyStan in PyData Tel-Aviv, it would be great if you can go over the code/amend/comment etc. Thanks!

fonnesbeck commented 7 years ago

Do you have data you can share with me? I can tinker with it and see what I model an do.

ghost commented 7 years ago

Hi Chris, yes, the data is here: https://github.com/QuantScientist/kaggle-seizure-prediction-challenge-2016/tree/master/data/output

The training set is here: https://github.com/QuantScientist/kaggle-seizure-prediction-challenge-2016/blob/master/data/output/feat_train/train_allX_df_train.hdf It is based on features I generated from the IEEG signals.

For reference, there is also a full corresponding pipeline in XGBOOST here: https://github.com/QuantScientist/kaggle-seizure-prediction-challenge-2016/tree/master/jupyter/xgboost

If you amend anything, I will of course credit you.

ghost commented 7 years ago

@fonnesbeck please also see our previous discussion about this here: https://github.com/pymc-devs/pymc3/issues/840

fonnesbeck commented 7 years ago

@QuantScientist Have a look here. What I've done is:

used NUTS with ADVI initialization
reparameterized to include an intercept
rescaled (normalized) the predictors
used Lasso regression to eliminate some of the variables, as several were almost perfectly correlated.

Let me know if you have questions.

ghost commented 7 years ago

@fonnesbeck thanks a lot for the modifications! I used your revised version and uploaded a fresh notebook here: https://github.com/QuantScientist/kaggle-seizure-prediction-challenge-2016/blob/master/jupyter/bayesian-logistic-regression/ieegPymc3Version1.ipynb

However, I must be doing something wrong since the score I get is quite bad compared to sk-learn logistic regression. 2017-02-13 14_32_56-melbourne university aes_mathworks_nih seizure prediction _ kaggle

Maybe I am doing something wrong with the intercept since I was not using one before: Theta includes an Intercept term which was added by Chris, so we have to adjust for it here: ' def fastPredict(new_observation, theta):
v = np.einsum('j,j->',new_observation, theta[1:theta.size])
return expit(w_intercept + v) `

fonnesbeck commented 7 years ago

That's interesting. I didn't do any optimization of the model, so perhaps not too surprising. So, I've used an L1 norm with a particular (arbitrary) lambda of 0.1. What if you used the same setup as sklearn does -- looks like ridge regression with lambda of 0.01?

ghost commented 7 years ago

Hi Thomas. I found that the test data set was not normalized so I normalized it as you did for the training set. However, upon submission, the score remains the same.
Updated notebook is here: https://github.com/QuantScientist/kaggle-seizure-prediction-challenge-2016/blob/master/jupyter/bayesian-logistic-regression/ieegPymc3Version1.ipynb

Thanks,

pymc-devs / pymc

Using pymc3 for Kaggle #1167