move monitor and logdir arguments to init

tensorflow / skflow

Simplified interface for TensorFlow (mimicking Scikit Learn) for Deep Learning

Apache License 2.0

3.18k stars 439 forks source link

move monitor and logdir arguments to init #125

Closed mheilman closed 8 years ago

mheilman commented 8 years ago

This moves the monitor and logdir arguments for the fit method to the base estimator __init__ method so that fit better follows the sklearn API guidelines: "fit parameters should be restricted to directly data dependent variables". This should make it easier to use, e.g., GridSearchCV with skflow models.

googlebot commented 8 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please let us know the company's name.

mheilman commented 8 years ago

I'm probably missing changes in the docs or examples here. If this looks OK in general, I'll make a pass at documentation updates (and sign the CLA, etc.).

(also, oops, fixing the tests now...)

ilblackdragon commented 8 years ago

Hi, @mheilman thanks for PR, but current interface of monitor was organized this way according to monitors in current estimators: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier.fit

And there are issues with passing monitor and logdir to init due to the constructor params been used as hyperparams - they are saved / restored with a model. The user may want to build a model with one monitor / logdir and then restore with another.

I agree though that Pipeline / GridSearchCV don't support monitors - I actually think this should be fixed in sklearn.

mheilman commented 8 years ago

Ah, right, I forgot that gradient boosting models have a monitor argument.

In fact, GridSearchCV.__init__ has a fit_params keyword argument, so logdir and monitor can just be passed in there. Sorry I didn't notice that earlier. I had been looking at GridSearchCV.fit (which only takes X and y).

I'm not sure how the fit params would work with pipelines, but they might be fine there, too, or will eventually.

ilblackdragon commented 8 years ago

Didn't know about fit_params, thanks for pointing out! If you are interested, you can try adding an example that uses GridSearchCV with monitor and logdir - I think would be pretty useful.

mheilman commented 8 years ago

Hmm... what about keyword arguments to predict methods (e.g., batch_size and axis here)? I don't see a way to access those via GridSearchCV.

The default for axis seems fine for my purposes, but I ran into an issue with the batch_size default, where GridSearchCV was taking a ton of memory (> 8GB) because it was trying to predict on the entire dataset at once rather than in batches. Do you think it would make sense to make the default batch_size for predict be self.batch_size (but still let the user override is when needed)?

I'd be happy to make a PR if you think that makes sense. Sorry, it's a bit of a separate issue than this PR, but I figured it was related enough to discuss here...

ilblackdragon commented 8 years ago

Yeah, it make sense to have some default value and self.batch_size is closer then anything else. PR would be welcome!

On Thu, Mar 3, 2016 at 1:21 PM, Michael Heilman notifications@github.com wrote:

Hmm... what about keyword arguments to predict methods (e.g., batch_size and axis here https://github.com/tensorflow/skflow/blob/67d3fb03efdfa37e2b7013e2802d1b737f8fbe85/skflow/estimators/base.py#L270)? I don't see a way to access those via GridSearchCV.

The default for axis seems fine for my purposes, but I ran into an issue with the batch_size default, where GridSearchCV was taking a ton of memory (> 8GB) because it was trying to predict on the entire dataset at once rather than in batches. Do you think it would make sense to make the default batch_size for predict be self.batch_size (but still let the user override is when needed)?

I'd be happy to make a PR if you think that makes sense.

— Reply to this email directly or view it on GitHub https://github.com/tensorflow/skflow/pull/125#issuecomment-191969550.

Best regards, Illia Polosukhin

mheilman commented 8 years ago

OK, I made that minor change (and added an example of grid search) in #126.