tensorflow / skflow

Simplified interface for TensorFlow (mimicking Scikit Learn) for Deep Learning
Apache License 2.0
3.18k stars 439 forks source link

another ValidationMonitor with validation(+early stopping) per epoch #133

Closed alanyuchenhou closed 8 years ago

alanyuchenhou commented 8 years ago

From what I understand, the existing ValidationMonitor performs validation every [print_steps] steps, and checks for stop condition every [early_stopping_rounds] steps. I'd like to add another ValidationMonitor that performs validation once and checks for stoping condition once every epoch. Is this the recommended practice in machine learning regarding validation and early stopping? I mean I'd like to add a fit process something like this:

def fit(self, x_train, y_train, x_validate, y_validate):
    while (current_validation_loss < previous_validation_loss):
        estimator.train_one_more_epoch(x_train, y_train)
        previous_validation_loss = current_validation_loss
        current_validation_loss = some_error(y_validate, estimator.predict(x_validate))
alanyuchenhou commented 8 years ago

@dansbecker I also noticed the inefficiency mentioned in #102 by @mheilman. I think the inefficiency problem is in this loop: https://github.com/tensorflow/skflow/blob/master/skflow/trainer.py#L113 Calling monitor.update() in the loop is too expensive and too fine-grained for most practical applications.

Can we consider moving the monitor.update() to https://github.com/tensorflow/skflow/blob/master/skflow/estimators/base.py#L236 ? and have something like this:

def fit(self, X, y, monitor=None, logdir=None):
   ...
   for epoch in range(monitor.n_epochs_max_tolerable):
       self._trainer.train()
       monitor.update()
       if monitor.monitor_inducing_stop():
           break

In this way, the monitor is invoked every epoch to check over-fitting(is it called over-training or over-fitting?) and stop the fit process when over-training occurs.

ilblackdragon commented 8 years ago

Actually may be a better option is to have monitor in a separate thread and just push some information into it from time to time from main thread.

waleedka commented 8 years ago

I've struggled with the inefficiency mentioned here as well. My validation set is 25,000 records (30% of my data), and my mini-batch is 20. When I use the ValidationMonitor, I end up training on 20 records and then calculating the validation error on 25,000 records, which slows my training by a 100x or more.

Putting the monitor in a separate thread, as @ilblackdragon suggested, is interesting but won't solve the problem in every case. For example, if training a mini-batch takes 1 second and calculating the validation error takes a 100 seconds, then the monitor thread will fall behind and won't be able to stop the training in time.

I solved this locally by modifying ValidationMonitor._set_last_loss_seen() in monitors.py to run once every print_steps. It's a simple fix that doesn't require passing additional parameters. And it's intuitive to have the validation test be done at the same frequency as the printing of it's values.

To address the original issue of this thread (validation every epoch), the value of print_steps could be set to a large enough number such that the printing and the validation test, both, happen once per epoch.

If I get a thumbs up on this approach, I can create a PR for it.

ilblackdragon commented 8 years ago

I think the problem you observe can be fixed by adding validation over batches instead of full set every time and moving average of the score.

On Sun, Apr 24, 2016 at 3:15 AM, Waleed notifications@github.com wrote:

I've struggled with the inefficiency mentioned here as well. My validation set is 25,000 records (30% of my data), and my mini-batch is 20. When I use the ValidationMonitor, I end up training on 20 records and then calculating the validation error on 25,000 records, which slows my training by a 100x or more.

Putting the monitor in a separate thread, as @ilblackdragon https://github.com/ilblackdragon suggested, is interesting but won't solve the problem in every case. For example, if training a mini-batch takes 1 second and calculating the validation error takes a 100 seconds, then the monitor thread will fall behind and won't be able to stop the training in time.

I solved this locally by modifying ValidationMonitor._set_last_loss_seen() in monitors.py to run once every print_steps. It's a simple fix that doesn't require passing additional parameters. And it's intuitive to have the validation test be done at the same frequency as the printing of it's values.

To address the original issue of this thread (validation every epoch), the value of print_steps could be set to a large enough number such that the printing and the validation test, both, happen once per epoch.

If I get a thumbs up on this approach, I can create a PR for it.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/tensorflow/skflow/issues/133#issuecomment-213926029

Best regards, Illia Polosukhin

waleedka commented 8 years ago

@ilblackdragon That's a good solution. I remember seeing a discussion about supporting more early stopping options, and what you mentioned seems like it belongs as part of that.

In the meantime, if someone needs an urgent fix, here is the the two lines I changed to fix the performance issue for me. It simply calculates the validation error once every print_steps rather than with every step.

https://github.com/waleedka/tensorflow/commit/2ef359c3f1aede71ae3a6013cb6dbdf0e74189fe

ilblackdragon commented 8 years ago

Let me add this actually to the master - I think it's an important fix.

On Fri, Apr 29, 2016 at 5:33 PM, Waleed notifications@github.com wrote:

@ilblackdragon https://github.com/ilblackdragon That's a good solution. I remember seeing a discussion about supporting more early stopping options, and what you mentioned seems like it belongs as part of that.

In the meantime, if someone needs an urgent fix, here is the the two lines I changed to fix the performance issue for me. It simply calculates the validation error once every print_steps rather than with every step.

waleedka/tensorflow@2ef359c https://github.com/waleedka/tensorflow/commit/2ef359c3f1aede71ae3a6013cb6dbdf0e74189fe

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/tensorflow/skflow/issues/133#issuecomment-215914909

Best regards, Illia Polosukhin

terrytangyuan commented 8 years ago

Feel like this is addressed in the latest version. Please submit an issue/PR to TensorFlow if it's not there. Thanks!