joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
https://joshday.github.io/OnlineStats.jl/latest/
MIT License
835 stars 63 forks source link

High level model builder #22

Closed tbreloff closed 8 years ago

tbreloff commented 9 years ago

Sometimes it can be a daunting task to select the appropriate model for a given dataset. It would be great to provide a helper framework (possibly a separate package or business) that could help choose/setup the best model given high-level information about the data set:

I've seen many graphics which essentially create a decision tree given lots of high-level information about a data problem, and point you at the right solution type (linear regression vs logistic regression vs dimensionality reduction vs SVM vs random forests vs ???). I think this is something that could be implemented alongside an ensemble framework which could choose lots of candidate models for you and drop/average/vote on the best predictions. In the online setting, ensembles could be relatively cheap, even for large datasets (especially if the online algorithm allows for parallel fitting)

(It's conceivable that this could be a SaaS business in its own right... high level online data science platform built on top of OnlineStats and OnlineAI)

joshday commented 9 years ago

This is an interesting idea. I don't think I've seen anything that tries to automate model selection like this. It would be an easy and powerful tool, especially since many online algorithms can be designed to be self-tuning. I'm intrigued. Let's talk more.

Evizero commented 9 years ago

Hi! Just here to drop some links

I've seen many graphics which essentially create a decision tree given lots of high-level information about a data problem, and point you at the right solution type (linear regression vs logistic regression vs dimensionality reduction vs SVM vs random forests vs ???)

The most famous one is probably from scikit-learn

... an ensemble framework which could choose lots of candidate models for you and drop/average/vote on the best predictions. In the online setting, ensembles could be relatively cheap, even for large datasets (especially if the online algorithm allows for parallel fitting)

There is an interesting reference python implementation concerning automatic ensemble building

tbreloff commented 9 years ago

Thanks for the links. I'm starting to work on ensembles in my package OnlineAI.jl, which extends OnlineStats. I'll certainly use this as a reference.

On Sep 4, 2015, at 1:26 PM, Christof Stocker notifications@github.com wrote:

Hi! Just here to drop some links

I've seen many graphics which essentially create a decision tree given lots of high-level information about a data problem, and point you at the right solution type (linear regression vs logistic regression vs dimensionality reduction vs SVM vs random forests vs ???)

The most famous one is probably from scikit-learn

... an ensemble framework which could choose lots of candidate models for you and drop/average/vote on the best predictions. In the online setting, ensembles could be relatively cheap, even for large datasets (especially if the online algorithm allows for parallel fitting)

There is an interesting reference python implementation concerning automatic ensemble building

— Reply to this email directly or view it on GitHub.

Evizero commented 9 years ago

What is your position on callback functions? (this question actually goes to both of you for OnlineAi and OnlineStats). You two seem to be doing a very good job and also seem to be very active so I would really love to use your work where it makes sense. I do have the design restriction that I would require callback functions that ideally support early stopping. OnlineStats seems to offer this if I use the low-level API as far as I can tell with the update! methods.

Background: I am working on a supervised learning front end (somewhat inspired by scikit learn and caret among others) where I also work on data abstractions for file streaming / in-memory data sets in various forms. I am currently investigating what libraries to use as back-end for specific things. Deterministic optimization seems pretty much set (pending some PRs / issues here and there) on Optim.jl for low-level access, and Regression.jl. Where I am unsure is stochastic optimization. There is SGDOptim.jl but it's not really actively (at least visible) being worked on. I'm also considering Mocha.jl but it does come with a lot of baggage. Your two projects seem very promising in that regard.

What are your thoughts on this?

tbreloff commented 9 years ago

You should look through the source in https://github.com/tbreloff/OnlineAI.jl/tree/master/src/nnet. I'm working on a bunch of things that you might be interested in, including various ways to split and sample static datasets, various stochastic gradient algorithms, and lots of cool (and easy to use!) neural net stuff... Dropout, regularization, flexible cost functions and activations, and even a normalization technique that I haven't seen anywhere else which I converted into an online algorithm (google "Batch Normalization"). In my opinion, it's much easier to use than something like Mocha.jl, and opens up streaming or parallel algorithms for big data sets. Not to mention you can combine and leverage all of OnlineStats, including the cool "stream" macro I made.

As for you questions on callbacks... My thought is that the functionality of nnet/solver.jl will end up embedded in the update function, and things like early stopping could be accomplished by setting certain flags and occasionally triggering callbacks to check against a validation set. I'm still actively thinking through design, and my goal is for something that should cover your needs.

On Sep 4, 2015, at 2:46 PM, Christof Stocker notifications@github.com wrote:

What is your position on callback functions? (this question actually goes to both of you for OnlineAi and OnlineStats). You two seem to be doing a very good job and also seem to be very active so I would really love to use your work where it makes sense. I do have the design restriction that I would require callback functions that ideally support early stopping. OnlineStats seems to offer this if I use the low-level API as far as I can tell with the update! methods.

Background: I am working on a supervised learning front end (somewhat inspired by scikit learn and caret among others) where I also work on data abstractions for file streaming / in-memory data sets in various forms. I am currently investigating what libraries to use as back-end for specific things. Deterministic optimization seems pretty much set (pending some PRs / issues here and there) on Optim.jl for low-level access, and Regression.jl. Where I am unsure is stochastic optimization. There is SGDOptim.jl but it's not really actively (at least visible) being worked on. I'm also considering Mocha.jl but it does come with a lot of baggage. Your two projects seem very promising in that regard.

What are your thoughts on this?

— Reply to this email directly or view it on GitHub.

Evizero commented 9 years ago

I am absolutely interested in the neural net stuff. I will look into the code in close detail.

Concerning callbacks: I do have some time before I get to include stochastic optimization, so don't feel rushed.

Something that troubles me at first glance: Do I see right that you use the matrix rows to denote observations? I know this is the usual notation in textbooks but as far as I know from julia using the columns to denote the observations is better for performance because of the array memory layout

tbreloff commented 9 years ago

Yes I think Josh and I were both more concerned with getting the code correct... I made the decision early on that I could live with the performance implications of row-based matrices. I'm holding out hope that we'll have performant row-based array storage in Julia at some point (even if I have to implement it myself), because no matter how hard I try I find column-based storage annoying to use.

On Sep 4, 2015, at 3:51 PM, Christof Stocker notifications@github.com wrote:

I am absolutely interested in the neural net stuff. I will look into the code in close detail.

Concerning callbacks: I do have some time before I get to include stochastic optimization, so don't feel rushed.

Something that troubles me at first glance: Do I see right that you use the matrix rows to denote observations? I know this is the usual notation in textbooks but as far as I know from julia using the columns to denote the observations is better for performance because of the array memory layout

— Reply to this email directly or view it on GitHub.

tbreloff commented 9 years ago

Also remember that you can update one point at a time by looping over the columns of a column-based matrix... You just lose the short helper function which does the loop for you.

On Sep 4, 2015, at 3:51 PM, Christof Stocker notifications@github.com wrote:

I am absolutely interested in the neural net stuff. I will look into the code in close detail.

Concerning callbacks: I do have some time before I get to include stochastic optimization, so don't feel rushed.

Something that troubles me at first glance: Do I see right that you use the matrix rows to denote observations? I know this is the usual notation in textbooks but as far as I know from julia using the columns to denote the observations is better for performance because of the array memory layout

— Reply to this email directly or view it on GitHub.

Evizero commented 9 years ago

because no matter how hard I try I find column-based storage annoying to use

I absolutely agree on that.

However, it does kinda make it hard to interface the library when the column-based format (which I do) looping through the columns should probably do the trick for me as you just described.

I have seen the TransposeView{T} which seems like a good way to internally pretend it's a row-based index. Maybe that might be a solution to make use of the column based performance without the sacrifice of code clarity. Or what is this class for?

tbreloff commented 9 years ago

TransposeView may work for this (or at least be the beginning of an implementation). I made it so that I could create "tied matrices" in stacked autoencoders... Essentially the weight matrix from one layer is the transpose of the weight matrix from a previous layer. This was straightforward since the layers now share the same underlying matrix.

On Sep 4, 2015, at 6:06 PM, Christof Stocker notifications@github.com wrote:

because no matter how hard I try I find column-based storage annoying to use I absolutely agree on that.

However, it does kinda make it hard to interface the library when the column-based format (which I do) looping through the columns should probably do the trick for me as you just described.

I have seen the TransposeView{T} which seems like a good way to internally pretend it's a row-based index. Maybe that might be a solution to make use of the column based performance without the sacrifice of code clarity. Or what is this class for?

— Reply to this email directly or view it on GitHub.

joshday commented 9 years ago

I've been traveling...Tom seems to have your questions well covered, but I'll chime in here. I'd love to stay updated with what you're working on and what you'd like to see in OnlineStats. My next OnlineStats project is variance components models, but I'm happy to work on things people are actually using.

joshday commented 8 years ago

This is definitely JuliaML material.

tbreloff commented 8 years ago

Oh man... I can't believe this was a year ago!

joshday commented 8 years ago

Is this essentially the birthplace for @tbreloff's vision of JuliaML? It's a part of history, now.