neurodata / SPORF

This is the implementation of Sparse Projection Oblique Randomer Forest
https://neurodata.io/forests/
97 stars 46 forks source link

use R-format for classifier specification and streamline IO for prediction #16

Open ebridge2 opened 6 years ago

ebridge2 commented 6 years ago

wrap the forest, X when rank transforming, etc. anything that predict would need for determining prediction to simplify your function calls as a single class object. Then, use the S3 predict function instead of Predict that you guys wrote. That is considered "standard practice" for ML techniques in R; big packages like caret, lmer, etc. use it. Ie:

https://github.com/neurodata/lol/blob/master/R/random_classifier.R

This classifier an be used as:

model <- lol.classify.randomGuess(X[train,], Y[train])  # model has everything needed for prediction already
prediction <- predict(model, X[-train,])  # use S3 predict method for consistency with other classifiers in R language; calls predict.randomGuess from the above file

For example, if you have rank.transform = TRUE as specified in your docs, there's no need to require users to have to know to specify their X when predicting; you can output a "RerF" class object that either contains or doesn't contain X depending on when rank.transform = TRUE, and then just wrap this in whatever is output by RerF and pass that whole thing to predict.

I'll be doing some R-RerF stuff in the future, so I've assigned this to myself for now as it seems like a decent initial goal

jovo commented 6 years ago

@ebridge2 they wrote the code to be like the randomForest package, i'm not sure if that is relevant to your comment.

ebridge2 commented 6 years ago

randomForest (I use that package and based my code for writing/training classifiers around their interface) uses predict.* S3 interface. see randomForest predict code. In randomForest, you can see there's a function called "predict.randomForest()" that takes an object as its first parameter. this is how you extend stats::predict, since predict() will know to use the predict.name associated with classifier name so long as it exists. When you predict with randomForest, you would just call predict(trained_model, X.new), NOT predict.randomForest(trained_model, X.new) because that's how S3 works in R. In R-Rerf, we write our own non-S3 Predict() function that does not extend the S3 method predict() in stats like theirs does.

As for comment about passing a variable for rank transform, that is just stylistic; that parameter is passed when training the forst so there's no reason to have to pass it twice when you can just return it as part of the trained model object (ie, in the trained output, set trained$rank=TRUE and then do a simple check in the predict() method on the object) and then pass the whole model object to predict().

ebridge2 commented 6 years ago

the advantage is that let's say I train a bunch of models using a bunch of packages (randomForest, glm, R-RerF) I can use the same predict(trained_model, X.new) for all of them if you extend the S3 method properly.

jbrowne6 commented 6 years ago

This is a good direction to take and can be implemented in a way that does not affect backwards compatibility.

tyler-tomita commented 6 years ago

Agreed. I would assume this is a requirement for integration with the caret package, which is what I ultimately would like to see happen at some point.

On Mon, Jun 4, 2018 at 10:02 AM James D. Browne Jr. < notifications@github.com> wrote:

This is a good direction to take and can be implemented in a way that does not affect backwards compatibility.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/neurodata/R-RerF/issues/16#issuecomment-394365692, or mute the thread https://github.com/notifications/unsubscribe-auth/AJVzkekYcrqA5pwP7BODOT4aqx4CpXaDks5t5T4GgaJpZM4TgXAM .

jovo commented 6 years ago

Why?

On Tue, Jun 5, 2018, 4:21 PM Tyler Tomita notifications@github.com wrote:

Agreed. I would assume this is a requirement for integration with the caret package, which is what I ultimately would like to see happen at some point.

On Mon, Jun 4, 2018 at 10:02 AM James D. Browne Jr. < notifications@github.com> wrote:

This is a good direction to take and can be implemented in a way that does not affect backwards compatibility.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/neurodata/R-RerF/issues/16#issuecomment-394365692, or mute the thread < https://github.com/notifications/unsubscribe-auth/AJVzkekYcrqA5pwP7BODOT4aqx4CpXaDks5t5T4GgaJpZM4TgXAM

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/neurodata/R-RerF/issues/16#issuecomment-394846691, or mute the thread https://github.com/notifications/unsubscribe-auth/AACjcjGn8Hqs6czChU0o5zDsKj_gvclFks5t5ugugaJpZM4TgXAM .

tyler-tomita commented 6 years ago

Why integrate with caret? It makes model tuning much easier. It provides a uniform interface for comparing many different algorithms. Most of the well-known algorithms, including XGBoost and Rborist have been integrated. Therefore, it will increase both visibility and accessibility to users.

Looking at caret's repo, it actually doesn't seem like there is a requirement to conform to standard s3 methods.

On Tue, Jun 5, 2018 at 6:54 PM joshua vogelstein notifications@github.com wrote:

Why?

On Tue, Jun 5, 2018, 4:21 PM Tyler Tomita notifications@github.com wrote:

Agreed. I would assume this is a requirement for integration with the caret package, which is what I ultimately would like to see happen at some point.

On Mon, Jun 4, 2018 at 10:02 AM James D. Browne Jr. < notifications@github.com> wrote:

This is a good direction to take and can be implemented in a way that does not affect backwards compatibility.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/neurodata/R-RerF/issues/16#issuecomment-394365692 , or mute the thread <

https://github.com/notifications/unsubscribe-auth/AJVzkekYcrqA5pwP7BODOT4aqx4CpXaDks5t5T4GgaJpZM4TgXAM

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/neurodata/R-RerF/issues/16#issuecomment-394846691, or mute the thread < https://github.com/notifications/unsubscribe-auth/AACjcjGn8Hqs6czChU0o5zDsKj_gvclFks5t5ugugaJpZM4TgXAM

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/neurodata/R-RerF/issues/16#issuecomment-394885863, or mute the thread https://github.com/notifications/unsubscribe-auth/AJVzkUDgkDi2WXaZHDUcbOzp5I9rw-Hyks5t5wwTgaJpZM4TgXAM .

ebridge2 commented 6 years ago

definitely no requirement, but it will make peoples' lives easier the more accessible the algorithm is :) S3 methods are super trivial to implement; it would probably be sub 4 lines to change from your current Predict() as far as I can tell. Just add @importFrom stats predict in the roxygen, rename your Predict() to predict.RandomerForest(object, X, additional_hyperparams, ...) (or whatever R-RerF training is called; here I've assumed RandomerForest), wrap the entire trained classifier (and all required hyperparameters related to training) in a single list (passed into predict() as the object argument and make it a structure(list(training_param1=#, training_param2=#, etc), class="RandomerForest"), and you are done (and prediction can be done as predict(trained_classifier, new_data, additional_args).

On Tue, Jun 5, 2018 at 9:27 PM, Tyler Tomita notifications@github.com wrote:

Why integrate with caret? It makes model tuning much easier. It provides a uniform interface for comparing many different algorithms. Most of the well-known algorithms, including XGBoost and Rborist have been integrated. Therefore, it will increase both visibility and accessibility to users.

Looking at caret's repo, it actually doesn't seem like there is a requirement to conform to standard s3 methods.

On Tue, Jun 5, 2018 at 6:54 PM joshua vogelstein <notifications@github.com

wrote:

Why?

On Tue, Jun 5, 2018, 4:21 PM Tyler Tomita notifications@github.com wrote:

Agreed. I would assume this is a requirement for integration with the caret package, which is what I ultimately would like to see happen at some point.

On Mon, Jun 4, 2018 at 10:02 AM James D. Browne Jr. < notifications@github.com> wrote:

This is a good direction to take and can be implemented in a way that does not affect backwards compatibility.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/neurodata/R-RerF/issues/16#issuecomment- 394365692 , or mute the thread <

https://github.com/notifications/unsubscribe-auth/ AJVzkekYcrqA5pwP7BODOT4aqx4CpXaDks5t5T4GgaJpZM4TgXAM

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/neurodata/R-RerF/issues/16#issuecomment-394846691 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ AACjcjGn8Hqs6czChU0o5zDsKj_gvclFks5t5ugugaJpZM4TgXAM

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/neurodata/R-RerF/issues/16#issuecomment-394885863, or mute the thread https://github.com/notifications/unsubscribe-auth/ AJVzkUDgkDi2WXaZHDUcbOzp5I9rw-Hyks5t5wwTgaJpZM4TgXAM

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/neurodata/R-RerF/issues/16#issuecomment-394910574, or mute the thread https://github.com/notifications/unsubscribe-auth/AIeNW8UGi1qZ9b857VfF4_9TaqU0qDhVks5t5zALgaJpZM4TgXAM .

-- Eric Bridgeford ericwb.me

jovo commented 6 years ago

i see. being caret compliant seems good, as does the S3 stuff. seems like something @ttomita or @ebridge can easily do at some point.

On Tue, Jun 5, 2018 at 9:56 PM, Eric Bridgeford notifications@github.com wrote:

definitely no requirement, but it will make peoples' lives easier the more accessible the algorithm is :) S3 methods are super trivial to implement; it would probably be sub 4 lines to change from your current Predict() as far as I can tell. Just add @importFrom stats predict in the roxygen, rename your Predict() to predict.RandomerForest(object, X, additional_hyperparams, ...) (or whatever R-RerF training is called; here I've assumed RandomerForest), wrap the entire trained classifier (and all required hyperparameters related to training) in a single list (passed into predict() as the object argument and make it a structure(list(training_param1=#, training_param2=#, etc), class="RandomerForest"), and you are done (and prediction can be done as predict(trained_classifier, new_data, additional_args).

On Tue, Jun 5, 2018 at 9:27 PM, Tyler Tomita notifications@github.com wrote:

Why integrate with caret? It makes model tuning much easier. It provides a uniform interface for comparing many different algorithms. Most of the well-known algorithms, including XGBoost and Rborist have been integrated. Therefore, it will increase both visibility and accessibility to users.

Looking at caret's repo, it actually doesn't seem like there is a requirement to conform to standard s3 methods.

On Tue, Jun 5, 2018 at 6:54 PM joshua vogelstein < notifications@github.com

wrote:

Why?

On Tue, Jun 5, 2018, 4:21 PM Tyler Tomita notifications@github.com wrote:

Agreed. I would assume this is a requirement for integration with the caret package, which is what I ultimately would like to see happen at some point.

On Mon, Jun 4, 2018 at 10:02 AM James D. Browne Jr. < notifications@github.com> wrote:

This is a good direction to take and can be implemented in a way that does not affect backwards compatibility.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/neurodata/R-RerF/issues/16#issuecomment- 394365692 , or mute the thread <

https://github.com/notifications/unsubscribe-auth/ AJVzkekYcrqA5pwP7BODOT4aqx4CpXaDks5t5T4GgaJpZM4TgXAM

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/neurodata/R-RerF/issues/16#issuecomment- 394846691 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ AACjcjGn8Hqs6czChU0o5zDsKj_gvclFks5t5ugugaJpZM4TgXAM

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/neurodata/R-RerF/issues/16#issuecomment-394885863 , or mute the thread https://github.com/notifications/unsubscribe-auth/ AJVzkUDgkDi2WXaZHDUcbOzp5I9rw-Hyks5t5wwTgaJpZM4TgXAM

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/neurodata/R-RerF/issues/16#issuecomment-394910574, or mute the thread https://github.com/notifications/unsubscribe- auth/AIeNW8UGi1qZ9b857VfF4_9TaqU0qDhVks5t5zALgaJpZM4TgXAM .

-- Eric Bridgeford ericwb.me

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/neurodata/R-RerF/issues/16#issuecomment-394914883, or mute the thread https://github.com/notifications/unsubscribe-auth/AACjcjxjVhABJgNeKpLALADSLyQk9Lyoks5t5zbPgaJpZM4TgXAM .

-- the glass is all full: half water, half air. neurodata.io ps - i am committed to responding to my emails, it often takes about a week. thank you for understanding.