tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.31k stars 1.55k forks source link

Better release notes #2463

Closed Conchylicultor closed 4 years ago

Conchylicultor commented 4 years ago

Currently, datasets updates are poorly documented. The API isn't great, and it requires versions to still be supported.

class MyDataset(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version('2.0.1', release_note='Fix typo')
  SUPPORTED_VERSIONS = [
      tfds.core.Version('2.0.0', release_note='Add new field'),
      tfds.core.Version('1.0.0', release_note='First release'),
  ]

Instead, it improve usability and discoverability to have an entirely separate field, independent of the supported version, like:

class MyDataset(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version('2.0.1')
  RELEASE_NOTES = {
      '2.0.1': 'Fix typo',
      '2.0.0': 'Add new field',
      '1.0.0': 'First release',
  }

Notes:

NikhilBartwal commented 4 years ago

@Conchylicultor I would like to work on this. If I got it right, we need to:

Could you point me as to how the tfds new <> template should be updated and if we need to do this for all datasets. How many datasets do you think i should update in a single PR?

Conchylicultor commented 4 years ago

Thank you for looking into this.

Could you point me as to how the tfds new <> template

By updating the template, I'm only talking about updating the https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/scripts/cli/new.py#L150 template here, like:

  VERSION = tfds.core.Version('1.0.0')
  RELEASE_NOTES = {
      '1.0.0': 'Initial release',
  }

Current datasets shouldn't be modified.

Remove the release notes from the VERSION.

I think at first we should keep both fields, for backward compatibility (but raise a logging.warning). Then once all users/datasets have been migrated, we could delete the kwarg in version.py.

Keshav15 commented 4 years ago

ohh I havent seen that You already created it.

Conchylicultor commented 4 years ago

Thank you for your contributions. I'll merge this soon.

In a side PR, we should remove the description='release notes' field from Version(description= and replace it by BuilderConfig(release_notes= / RELEASE_NOTES:

Here are the tests which seems impacted by the removal of Version(description=) kwargs:

tensorflow_datasets/audio/groove_test
tensorflow_datasets/audio/vctk_test
tensorflow_datasets/core/utils/gcs_utils_test
tensorflow_datasets/core/visualization/show_examples_test
tensorflow_datasets/core/as_dataframe_test
tensorflow_datasets/core/dataset_builder_notfdv_test
tensorflow_datasets/core/dataset_info_test
tensorflow_datasets/core/dataset_registered_test
tensorflow_datasets/core/lazy_imports_lib_test
tensorflow_datasets/core/tfrecords_reader_test
tensorflow_datasets/image/abstract_reasoning_test
tensorflow_datasets/image/binarized_mnist_test
tensorflow_datasets/image/celeba_test
tensorflow_datasets/image/celebahq_test
tensorflow_datasets/image/clevr_test
tensorflow_datasets/image/downsampled_imagenet_test
tensorflow_datasets/image/dsprites_test
tensorflow_datasets/image/lsun_test
tensorflow_datasets/image/shapes3d_test
tensorflow_datasets/image_classification/bigearthnet_test
tensorflow_datasets/image_classification/caltech_test
tensorflow_datasets/image_classification/cats_vs_dogs_test
tensorflow_datasets/image_classification/cbis_ddsm_test
tensorflow_datasets/image_classification/cifar10_corrupted_test
tensorflow_datasets/image_classification/colorectal_histology_test
tensorflow_datasets/image_classification/cycle_gan_test
tensorflow_datasets/image_classification/deep_weeds_test
tensorflow_datasets/image_classification/diabetic_retinopathy_detection_test
tensorflow_datasets/image_classification/horses_or_humans_test
tensorflow_datasets/image_classification/imagenet2012_corrupted_test
tensorflow_datasets/image_classification/imagenet2012_real_test
tensorflow_datasets/image_classification/imagenet2012_subset_test
tensorflow_datasets/image_classification/imagenet_test
tensorflow_datasets/image_classification/mnist_corrupted_test
tensorflow_datasets/image_classification/mnist_test
tensorflow_datasets/image_classification/patch_camelyon_test
tensorflow_datasets/image_classification/quickdraw_test
tensorflow_datasets/image_classification/rock_paper_scissors_test
tensorflow_datasets/image_classification/smallnorb_test
tensorflow_datasets/image_classification/so2sat_test
tensorflow_datasets/image_classification/svhn_test
tensorflow_datasets/image_classification/uc_merced_test
tensorflow_datasets/object_detection/kitti_test
tensorflow_datasets/structured/higgs_test
tensorflow_datasets/structured/iris_test
tensorflow_datasets/structured/titanic_test
tensorflow_datasets/summarization/big_patent_test
tensorflow_datasets/summarization/cnn_dailymail_test
tensorflow_datasets/summarization/xsum_test
tensorflow_datasets/testing/fake_data_generation/smallnorb_test
tensorflow_datasets/text/c4_test
tensorflow_datasets/text/glue_test
tensorflow_datasets/text/imdb_test
tensorflow_datasets/text/lm1b_test
tensorflow_datasets/text/multi_nli_test
tensorflow_datasets/text/wikipedia_test
tensorflow_datasets/text/wikipedia_toxicity_subtypes_test
tensorflow_datasets/translate/ted_hrlr_test
tensorflow_datasets/video/bair_robot_pushing_test
tensorflow_datasets/video/starcraft_test
tensorflow_datasets/video/ucf101_test

Rather than updating all files at once, it's best to split into multiple smaller PRs (of a few 5-10 datasets each)

Conchylicultor commented 4 years ago

Then in parallel, we should update scripts/documentation/ to use RELEASE_NOTES / builder_config.release_notes instead of the current builder.version.description.

For the display, if builder_config.release_notes exists, it should be used, otherwise, RELEASE_NOTES should be used.

NikhilBartwal commented 4 years ago

@Conchylicultor I'll start working on the changes and will send the PRs soon :)

NikhilBartwal commented 4 years ago

@Conchylicultor I ran the groove_test, vctk_test and the gcs_utils_test and they seem to fail with pytest even without any change with the Version(description=) Here is, for example, the complete log for gcs_utils_test even with the original Version(description=)

F:\tfds\datasets>pytest tensorflow_datasets/core/utils/gcs_utils_test.py
================================================= test session starts =================================================
platform win32 -- Python 3.7.4, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: F:\tfds\datasets, configfile: pytest.ini
collected 6 items

tensorflow_datasets\core\utils\gcs_utils_test.py FFFsFs                                                          [100%]

====================================================== FAILURES =======================================================
_________________________________________ GcsUtilsTest.test_download_dataset __________________________________________

self = <tensorflow_datasets.core.utils.gcs_utils_test.GcsUtilsTest testMethod=test_download_dataset>

    def test_download_dataset(self):
      files = [
          'dataset_info/mnist/2.0.0/dataset_info.json',
          'dataset_info/mnist/2.0.0/image.image.json',
      ]
      with self.gcs_access():
        self.assertCountEqual(
            gcs_utils.gcs_dataset_info_files('mnist/2.0.0'),
>           files,
        )
E       TypeError: 'NoneType' object is not iterable

tensorflow_datasets\core\utils\gcs_utils_test.py:42: TypeError
_______________________________________ GcsUtilsTest.test_is_dataset_accessible _______________________________________

self = <tensorflow_datasets.core.utils.gcs_utils_test.GcsUtilsTest testMethod=test_is_dataset_accessible>

    def test_is_dataset_accessible(self):
      # Re-enable GCS access. TestCase disables it.
      with self.gcs_access():
>       self.assertTrue(gcs_utils.is_dataset_on_gcs('mnist/1.0.0'))
E       AssertionError: False is not true

tensorflow_datasets\core\utils\gcs_utils_test.py:31: AssertionError
_______________________________________________ GcsUtilsTest.test_mnist _______________________________________________

self = <tensorflow_datasets.core.utils.gcs_utils_test.GcsUtilsTest testMethod=test_mnist>

    def test_mnist(self):
      with self.gcs_access():
        mnist = tfds.image_classification.MNIST(
>           data_dir=gcs_utils.gcs_path('datasets'))

tensorflow_datasets\core\utils\gcs_utils_test.py:56:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tensorflow_datasets\core\dataset_builder.py:186: in __init__
    self._data_dir_root, self._data_dir = self._build_data_dir(data_dir)
tensorflow_datasets\core\dataset_builder.py:677: in _build_data_dir
    _list_all_version_dirs(os.path.join(data_dir_root, builder_dir)))
tensorflow_datasets\core\dataset_builder.py:853: in _list_all_version_dirs
    if not tf.io.gfile.exists(root_dir):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

path = 'gs://tfds-data/datasets\\mnist'

    @tf_export("io.gfile.exists")
    def file_exists_v2(path):
      """Determines whether a path exists or not.

      Args:
        path: string, a path

      Returns:
        True if the path exists, whether it's a file or a directory.
        False if the path does not exist and there are no filesystem errors.

      Raises:
        errors.OpError: Propagates any errors reported by the FileSystem API.
      """
      try:
>       _pywrap_file_io.FileExists(compat.path_to_bytes(path))
E       tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme 'gs' not implemented (file: 'gs://tfds-data/datasets\mnist')

f:\python\lib\site-packages\tensorflow\python\lib\io\file_io.py:268: UnimplementedError
___________________________________ GcsUtilsDisabledTest.test_is_dataset_accessible ___________________________________

self = <tensorflow_datasets.core.utils.gcs_utils_test.GcsUtilsDisabledTest testMethod=test_is_dataset_accessible>

    def test_is_dataset_accessible(self):
      # Re-enable GCS access. TestCase disables it.
      with self.gcs_access():
        is_ds_on_gcs = gcs_utils.is_dataset_on_gcs('mnist/1.0.0')
>       self.assertTrue(is_ds_on_gcs)
E       AssertionError: False is not true

tensorflow_datasets\core\utils\gcs_utils_test.py:68: AssertionError
=============================================== short test summary info ===============================================
FAILED tensorflow_datasets/core/utils/gcs_utils_test.py::GcsUtilsTest::test_download_dataset - TypeError: 'NoneType' ...
FAILED tensorflow_datasets/core/utils/gcs_utils_test.py::GcsUtilsTest::test_is_dataset_accessible - AssertionError: F...
FAILED tensorflow_datasets/core/utils/gcs_utils_test.py::GcsUtilsTest::test_mnist - tensorflow.python.framework.error...
FAILED tensorflow_datasets/core/utils/gcs_utils_test.py::GcsUtilsDisabledTest::test_is_dataset_accessible - Assertion...
============================================ 4 failed, 2 skipped in 0.82s =============================================

Could you have a look at the PR to check if I have understood the change correctly? I think I'm not completely sure what needs to be changed here. It would be great if you could help me out. Thanks!

vijayphoenix commented 4 years ago

By looking at the logs like

path = 'gs://tfds-data/datasets\mnist'

@NikhilBartwal I guess some of those errors are because of Windows OS. So, it not related to RELEASE NOTES.

Related issue https://github.com/tensorflow/datasets/issues/1911#issuecomment-616224484 and pending part of PR https://github.com/tensorflow/datasets/pull/1916

NikhilBartwal commented 4 years ago

@Conchylicultor All files mentioned here have been updated as per the new version and release note format. There are, however, many other datasets, which have not been mentioned here (probably because they currently do not have any tests associated with them) having the old Version(description=). I will update those and send PRs soon!

NikhilBartwal commented 4 years ago

@Conchylicultor I would like to work on the updated documentation. As far as I have understood, we need to add a ReleaseNotesSection to the dataset_markdon_builder.py and update the display() method. Could you point me out as to what are the changes that needs to be incorporated? Thanks!

Conchylicultor commented 4 years ago

Thank you.

Rather than creating a new section. I think we should update the current version field (which already exists):

https://github.com/tensorflow/datasets/blob/4662c2cc0be04addb4f9cfd52f656bd60e367265/tensorflow_datasets/scripts/documentation/dataset_markdown_builder.py#L183

  1. We should add some _get_release_notes(builder) -> Dict[Version, str]: should return either builder.builder_config.release_notes or builder.RELEASE_NOTES, and apply utils.dedent() to the description string.

  2. Rather than iterating over builder.versions, we should take the union of sorted(set(builder.versions) & set(release_notes_versions)).