tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.27k stars 1.53k forks source link

Validate TFRecords after data generation for corrupted/truncated records #3330

Open whatwilliam opened 3 years ago

whatwilliam commented 3 years ago

Is your feature request related to a problem? Please describe. Sometimes when generating a dataset using tfds.load the data stored as TFRecords will get truncated or image data will get corrupted in the process but the dataset generation will continue without raising any flags.

It is not until a user begins training a model on the dataset that Tensorflow raises the DataLossError and notifies the user that the data has been either truncated or corrupted.

Especially for large datasets, it's annoying to spend a long time generating data, building a model, and training data only to be stopped by this error at the first step.

There is a way for the user to validate the TFRecords individually by creating tf.data.TFRecordDataset objects for each record, but a user would not think to do that unless they had encountered this problem in training.

Describe the solution you'd like I would like a way for the user to know that their TFRecords have been corrupted or truncated at the data generation step.

Describe alternatives you've considered A simple way to validate each TFRecord:

filenames = os.listdir('path/to/records/1.0.0')
i = 0
for fname in filenames:
  print('validating ', fname)
  records = tf.data.TFRecordDataset('path/to/records/1.0.0'+fname)
  try:
      for byte in records:
          i += 1
  except Exception as e:
      print('error in {} at record {}'.format(fname, i))
      print(e)

I would like a similar method implemented in tfds.load after the records have been made that at least gives users a warning about the corrupted/truncated data.

Additional context There is a way to bypass these errors during training by implementing tf.data.experimental.ignore_errors but it can be problematic because it ignores other errors that could be important and should only be used as a workaround.

Conchylicultor commented 3 years ago

Our dataset generation process is deterministic, so datasets should not have corrupted images. If this happens, we should fix the individual datasets which have corrupted images. Which are the datasets impacted ?

whatwilliam commented 3 years ago

@Conchylicultor This is not a problem I encounter with tensorflow_datasets' catalog, but rather it occurs when I am creating my own datasets. I will generate my dataset and only find out that some files had been corrupted/truncated once training had well begun. The solution for me is to simply regenerate the dataset and those corrupted records take care of themselves, I just wish I had known sooner, perhaps a record validator that tells you to regenerate the dataset or at least warns that the generated dataset contains corrupted records.

Conchylicultor commented 3 years ago

Then the solution is likely for you to check for corrupted examples in your _generate_examples function. This issue seems specific to your dataset/the data you're using. So I'm not sure we should add this check in TFDS as it would slow down generation for all datasets.