Any extra dataset prep needed?

sayakpaul commented 2 years ago

I have followed the instructions from README. I have set up a TPU v3-8 machine which can be confirmed below:

I have hosted the ImageNet-1k (imagenet2012) in a separate bucket and it's structured like the below (following instructions from here):

Screenshot 2022-05-10 at 4 03 31 PM

While launching training, I am using the following command:

gcloud alpha compute tpus tpu-vm ssh $NAME --zone=$ZONE --worker=all --command "TFDS_DATA_DIR=gs://imagenet-1k/tensorflow_datasets bash big_vision/run_tpu.sh big_vision.train --config big_vision/configs/vit_s16_i1k.py  --workdir gs://$GS_BUCKET_NAME/big_vision/workdir/`date '+%m-%d_%H%M'`"

It results into the following:

SSH key found in project metadata; not updating instance.
SSH: Attempting to connect to worker 0...
2022-05-10 10:30:25.858388: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-10 10:30:27.319919: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-05-10 10:30:27.319952: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
I0510 10:30:27.335715 140289404775488 xla_bridge.py:263] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0510 10:30:27.336199 140289404775488 xla_bridge.py:263] Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Interpreter TPU Host
I0510 10:30:30.058175 140289404775488 train.py:65] Hello from process 0 holding 8/8 devices and writing to workdir gs://big_vision_exp/big_vision/workdir/05-10_1030.
I0510 10:30:30.568850 140289404775488 train.py:95] NOTE: Global batch size 1024 on 1 hosts results in 1024 local batch size. With 8 dev per host (8 dev total), that's a 128 per-device batch size.
I0510 10:30:30.570343 140289404775488 train.py:95] NOTE: Initializing train dataset...
I0510 10:30:31.039579 140289404775488 dataset_info.py:522] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.1.0
I0510 10:30:31.303886 140289404775488 dataset_info.py:439] Load dataset info from /tmp/tmpggpl8znitfds
I0510 10:30:31.308489 140289404775488 dataset_info.py:492] Field info.description from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.308714 140289404775488 dataset_info.py:492] Field info.release_notes from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.308900 140289404775488 dataset_info.py:492] Field info.supervised_keys from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.308959 140289404775488 dataset_info.py:492] Field info.module_name from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.309248 140289404775488 logging_logger.py:44] Constructing tf.data.Dataset imagenet2012 for split _EvenSplit(split='train[:99%]', index=0, count=1, drop_remainder=False), from gs://imagenet-1k/tensorflow_datasets/imagenet2012/5.1.0
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/spsayakpaul/big_vision/train.py", line 372, in <module>
    app.run(main)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/spsayakpaul/big_vision/train.py", line 122, in main
    train_ds = input_pipeline.make_for_train(
  File "/home/spsayakpaul/big_vision/input_pipeline.py", line 69, in make_for_train
    data, _ = get_dataset_tfds(dataset=dataset, split=split,
  File "/home/spsayakpaul/big_vision/input_pipeline.py", line 53, in get_dataset_tfds
    return builder.as_dataset(
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/logging/__init__.py", line 81, in decorator
    return function(*args, **kwargs)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 565, in as_dataset
    raise AssertionError(
AssertionError: Dataset imagenet2012: could not find data in gs://imagenet-1k/tensorflow_datasets. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.

Is there anything I'm missing out here?

akolesnikoff commented 2 years ago

You are on the good track, but there is still one step missing.

After manually downloading the dataset, you need to run tfds once to reformat the data. We provided the script for doing this: big_vision/tools/download_tfds_datasets.py.

As indicated in the README, to launch data formatting on a TPU machine you could run

gcloud alpha compute tpus tpu-vm ssh $NAME --zone=$ZONE --worker=0 --command "TFDS_DATA_DIR=gs://imagenet-1k/tensorflow_datasetsbash big_vision/run_tpu.sh big_vision.tools.download_tfds_datasets imagenet2012"

Alternatively, you can even do it on your local machine by directly running the util, assuming the local machine has access to the cloud bucket.

Let us know whether it works for you. Leaving the issue open for now.

sayakpaul commented 2 years ago

Thank you! Giving it a try right now.

sayakpaul commented 2 years ago

gcloud alpha compute tpus tpu-vm ssh $NAME --zone=$ZONE --worker=0 --command "TFDS_DATA_DIR=gs://imagenet-1k/tensorflow_datasets bash big_vision/run_tpu.sh big_vision.tools.download_tfds_datasets imagenet2012"

leads to:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/spsayakpaul/big_vision/tools/download_tfds_datasets.py", line 43, in <module>
    app.run(main)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/spsayakpaul/big_vision/tools/download_tfds_datasets.py", line 39, in main
    tfds.load(name=d, data_dir="~/tensorflow_datasets/", download=True)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/load.py", line 325, in load
    dbuilder.download_and_prepare(**download_and_prepare_kwargs)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 462, in download_and_prepare
    self._download_and_prepare(
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1157, in _download_and_prepare
    split_generators = self._split_generators(  # pylint: disable=unexpected-keyword-arg
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/image_classification/imagenet.py", line 223, in _split_generators
    train_path = os.path.join(dl_manager.manual_dir, 'ILSVRC2012_img_train.tar')
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 152, in __get__
    cached = self.fget(obj)  # pytype: disable=attribute-error
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/download/download_manager.py", line 649, in manual_dir
    raise AssertionError(
AssertionError: Manual directory /home/spsayakpaul/tensorflow_datasets/downloads/manual does not exist or is empty. Create it and download/extract dataset artifacts in there using instructions:
manual_dir should contain two files: ILSVRC2012_img_train.tar and
ILSVRC2012_img_val.tar.
You need to register on http://www.image-net.org/download-images in order
to get the link to download the dataset.

sayakpaul commented 2 years ago

Will running the following help?

import tensorflow_datasets as tfds

data_dir = "gs://imagenet-1k/tensorflow_datasets"
builder = tfds.builder("imagenet2012", data_dir=data_dir)
builder.download_and_prepare()

sayakpaul commented 2 years ago

The error in https://github.com/google-research/big_vision/issues/2#issuecomment-1122519532 is expected I think since

https://github.com/google-research/big_vision/blob/8ca9d84a82d40f3245b5ab2daac5c2405b223351/big_vision/tools/download_tfds_datasets.py#L39

data_dir is already set here.

akolesnikoff commented 2 years ago

yeah, sorry, you likely need to manually override that variable as you suggested.

Let me know if you eventually succeed. In any case, once I have time, I will update the readme with well-tested instructions to get imagenet data to work.

sayakpaul commented 2 years ago

Sure!

I am currently running this:

https://github.com/google-research/big_vision/issues/2#issuecomment-1122533163

sayakpaul commented 2 years ago

Update.

This is the current error (I faced one regarding imagenet2012_real but was able to quickly resolve it):

11 01:50:04.274455 139742227639360 logging_logger.py:44] Constructing tf.data.Dataset imagenet_v2 for split _EvenSplit(split='test', index=0, count=1, drop_remainder=False), from gs://imagenet-1k/tensorflow_datasets/imagenet_v2/matched-frequency/3.0.0
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/spsayakpaul/big_vision/train.py", line 372, in <module>
    app.run(main)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/spsayakpaul/big_vision/train.py", line 270, in main
    evaluators = eval_common.from_config(
  File "/home/spsayakpaul/big_vision/evaluators/common.py", line 37, in from_config
    evaluator = module.Evaluator(model, **cfg)
  File "/home/spsayakpaul/big_vision/evaluators/classification.py", line 34, in __init__
    self.ds, self.steps = input_pipeline.make_for_inference(
  File "/home/spsayakpaul/big_vision/input_pipeline.py", line 97, in make_for_inference
    data, _ = get_dataset_tfds(dataset=dataset, split=split,
  File "/home/spsayakpaul/big_vision/input_pipeline.py", line 53, in get_dataset_tfds
    return builder.as_dataset(
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/logging/__init__.py", line 81, in decorator
    return function(*args, **kwargs)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 565, in as_dataset
    raise AssertionError(
AssertionError: Dataset imagenet_v2: could not find data in gs://imagenet-1k/tensorflow_datasets. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.

Currently doing (after installing tfds-nightly):

import tensorflow_datasets as tfds

data_dir = "gs://imagenet-1k/tensorflow_datasets"
ds = tfds.load("imagenet_v2", data_dir=data_dir, download=True)

It seems to be taking more than expected but will keep on updating anyway. I am maintaining a log here:

https://gist.github.com/sayakpaul/9544d3ba935805bd47d71fd8596e7bc0 (not yet complete).

sayakpaul commented 2 years ago

Looks like I was able to make things up and running:

--

I have also updated the gist I mentioned in https://github.com/google-research/big_vision/issues/2#issuecomment-1123162532.

Keeping it open until the training completes.

sayakpaul commented 2 years ago

Was able to reproduce everything (76.23% on ImageNet-1k validation set) within 90 epochs of pre-training on TPU v3-8 (that took 7 hours 22 mins to complete in total):

The following repository contains everything including the updated instructions, training logs, and the checkpoints:

https://github.com/sayakpaul/big_vision_experiments

google-research / big_vision

Any extra dataset prep needed? #2