Model stops training with variable-size dataset

intervolga-school commented 2 years ago

System information.

Have I written custom code (as opposed to using a stock example script provided in Keras): yes, but very simple case
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Colab
TensorFlow installed from (source or binary): Colab default
TensorFlow version (use command below): v2.7.0-0-gc256c071bb2 2.7.0
Python version: 3
Bazel version (if compiling from source): no
GPU model and memory: no
Exact command to reproduce: https://colab.research.google.com/drive/1fY4v9WBRxfsywDyKKidu-lmFpaPdAn9D?usp=sharing

Describe the problem.

In real case I use tf.data.Dataset (based on tensorflow_datasets) instance to train model. One big difference from default examples of keras.Model.fit + Dataset is unknown (variable) dataset length. In my case dataset length is variable (+- 20%) because i make some random augmentations with filtering out some of them. See provided colab link to see what i mean.

As result when first epoch is finished (dataset has reached OutOfRangeError), keras remembers current step an if the same dataset on the next epoch has smaller length, all model training will be stopped.

Describe the current behavior. Model stops training if second/third/etc dataset iterator has length smaller then first one.

Describe the expected behavior. Model should not stop training. It can print warning, but not stop it.

Do you want to contribute a PR? (yes/no): no

Standalone code to reproduce the issue. https://colab.research.google.com/drive/1fY4v9WBRxfsywDyKKidu-lmFpaPdAn9D?usp=sharing

Source code / logs.


model.fit(dataset, epochs=100)

# Epoch 1/15
# 819/819 [==============================] - 2s 1ms/step - loss: 1.3987
# Epoch 2/15
# 819/819 [==============================] - 1s 1ms/step - loss: 1.0563
# Epoch 3/15
# 819/819 [==============================] - 1s 1ms/step - loss: 1.0262
# Epoch 4/15
# 819/819 [==============================] - 1s 1ms/step - loss: 1.0156
# Epoch 5/15
# 782/819 [===========================>..] - ETA: 0s - loss: 1.0146WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 12285 batches). You may need to use the repeat() function when building your dataset.
# 819/819 [==============================] - 1s 1ms/step - loss: 1.0161

chunduriv commented 2 years ago

@sachinprasadhs, I was able to reproduce the issue on Colab using TF2.7 and tf-nightly(2.8.0-dev20211201). Please find the gist here for reference.Thanks!

sachinprasadhs commented 2 years ago

You can use the solution mentioned here to avoid warning and continue training.

shkarupa-alex commented 2 years ago

I know a workaround, but that is very ugly:

measure dataset length by hands (i use bucketing, so len(dataset) is None and measuring length in my case takes around 30 min)
dataset = dataset.repeat(2).take(measured_len)

This is a bad solution. In my opinion model should continue training until reaches num_epochs even if some epoch has less batches then first one. Displaying the number of steps remaining within an epoch is not as important as the completion of all epochs.

haifeng-jin commented 2 years ago

Waiting for triage. Summary: When the dataset has a different number of samples from epoch to epoch (the batch size are the same, the number of steps are different), the training will stop at a epoch, whose number of steps is different from the first epoch.

rchao commented 2 years ago

Thanks for reporting the issue - one solution is to use a steps_per_epoch that's large enough for the number of data in all epochs, and have the termination of an epoch rely on exhaustion of data (OutOfRangeError). Can you check if this works?

shkarupa-alex commented 2 years ago

Got same issue when implementing word2vec model. Dataset size changes from epoch to epoch due to:

randomness in skipgram/cbow context size
randomness in downsampling with threshold

Single estimation number of batches takes around 4 hours (very large dataset). And this size can changes with +- 20% from epoch to epoch.

So setting steps_per_epoch is not a good option. It would be great if keras.Model will always look at OutOfRangeError itself.

keras-team / tf-keras

Model stops training with variable-size dataset #131