tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.31k stars 1.54k forks source link

downsampled_imagenet broken #4662

Open marikgoldstein opened 1 year ago

marikgoldstein commented 1 year ago

Hi TFDS,

downsampled_imagenet (32x32) gives a 404 (stack trace at end of issue). This is because the imagenet link stored by tfds (https://image-net.org/small/download.php) is broken. The broken link is also featured in some papers such as Pixel Recurrent Neural Networks.

There is a different New currently-working link for 32x32 imagenet (https://image-net.org/download-images.php, if you log in, you can see a 32x32 option).

Let us refer to them as OLD (what TFDS used to host) and NEW (currently on imagenet website).

An anon. ICLR reviewer (see "weaknesses" under reviewer AKwV) mentioned that NEW is "too easy" and cannot be used to compare to old results using OLD. The reviewer also mentioned that OLD floats around the community on some torrent.

TFDS' link to OLD likely broke more recently than 9 months ago since another Google repo shared code that uses tfds to get downsampled_imagenet (I left an issue there https://github.com/google-research/vdm/issues/8) and their datasets.py file was pushed then.

None of these are the same as imagenet_resized.

Purpose:

Possible solution:

Examples of research using OLD

Some ICLR publications from this year already use NEW.

Thanks! Mark

Environment information

Yes

Reproduction instructions

import tensorflow_datasets as tfds                                                                                                                                                               
ds = tfds.load('downsampled_imagenet', split='validation', as_supervised=True, batch_size=128)

Link to logs

2023-01-18 12:03:50.178320: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-18 12:03:51.793197: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/marik/tensorflow_datasets/downsampled_imagenet/32x32/2.0.0...
Dl Size...: 0 MiB [00:00, ? MiB/s]                                                                                                                                       | 0/2 [00:00<?, ? url/s]
Dl Completed...:   0%|                                                                                                                                                   | 0/2 [00:00<?, ? url/s]
Traceback (most recent call last):
  File "/home/marik/imnet2.py", line 2, in <module>
    ds = tfds.load('downsampled_imagenet', split='validation', as_supervised=True, batch_size=128)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/logging/__init__.py", line 250, in decorator
    return function(*args, **kwargs)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/load.py", line 575, in load
    dbuilder.download_and_prepare(**download_and_prepare_kwargs)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_builder.py", line 523, in download_and_prepare
    self._download_and_prepare(
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1244, in _download_and_prepare
    split_generators = self._split_generators(  # pylint: disable=unexpected-keyword-arg
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/image/downsampled_imagenet.py", line 102, in _split_generators
    train_path, valid_path = dl_manager.download([
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 552, in download
    return _map_promise(self._download, url_or_urls)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 770, in _map_promise
    res = tf.nest.map_structure(lambda p: p.get(), all_promises)  # Wait promises
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 917, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 917, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 770, in <lambda>
    res = tf.nest.map_structure(lambda p: p.get(), all_promises)  # Wait promises
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 512, in get
    return self._target_settled_value(_raise=True)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 516, in _target_settled_value
    return self._target()._settled_value(_raise)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 226, in _settled_value
    reraise(type(raise_val), raise_val, self._traceback)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 844, in handle_future_result
    resolve(future.result())
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/downloader.py", line 217, in _sync_download
    with _open_url(url, verify=verify) as (response, iter_content):
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/downloader.py", line 279, in _open_with_requests
    _assert_status(response)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/downloader.py", line 306, in _assert_status
    raise DownloadError('Failed to get url {}. HTTP code: {}.'.format(
tensorflow_datasets.core.download.downloader.DownloadError: Failed to get url https://image-net.org/small/train_32x32.tar. HTTP code: 404.
marikgoldstein commented 1 year ago

I also reached out to the imagenet moderators to hear their input and will post any response here.

marikgoldstein commented 1 year ago

@Kim-Dongjun provided a good explanation and shared the location of the torrent that people use for the original data from pixel rnn. Here is Dongjun's explanation of the discrepancy (which also coincides with things I've heard from some authors at talks/conferences):

marikgoldstein commented 1 year ago

Here is a summary.

For imagenet 32x32, some papers use an "old" version and some use a "new" version. My understanding is:

My proposals are

Thanks, curious about others' take on this issue and for others to confirm.