tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.3k stars 1.54k forks source link

Help needed using TFDS from config file in a Docker container #3574

Open esparig opened 2 years ago

esparig commented 2 years ago

Coming from Issue on how to train a Resnet50 using Imagenet from Scratch

What I need help with / What I was wondering I'm trying to train from scratch Resnet50 from TF Model garden using Imagenet. I need to prepara the dataset and I'm trying to use tfds (loaded from yaml config file, as suggested on previous opened issue).

I got an error that says "Not enough disk space", but I do have more than 200GB available. Any further suggestions?

Note that I need to execute everything from a docker container, because this container willl be used to test several infrastructures.

What I've tried so far Here you can see: 1) Imagenet data is downloaded, 2) I mount that volume in a docker container, 3) Inside that container I run train,py from model garden as was indicated in previous issue, a configuration is generated as shown. 4) I got the error: OSError: Not enough disk space. Needed: 155.84 GiB, 5) However, df shows that there is more than 200GB available.

ubuntu@gpu-estibaliz:/hdd500/data/imagenet_tars/imagenet$ ll

total 151020536
drwxr-xr-x 6 root root         4096 Nov 17 11:53 ./
drwxr-xr-x 3 root root         4096 Nov 17 11:53 ../
drwxr-xr-x 4 2016 2016         4096 Jun 14  2012 ILSVRC2012_devkit_t12/
-rw-r--r-- 1 root root      2568145 Jun 15  2012 ILSVRC2012_devkit_t12.tar.gz
-rw-r--r-- 1 root root 147897477120 Jun 14  2012 ILSVRC2012_img_train.tar
-rw-r--r-- 1 root root   6744924160 Jun 14  2012 ILSVRC2012_img_val.tar
drwxr-xr-x 2 root root         4096 Sep 21 16:28 __pycache__/
drwxr-xr-x 2 root root         4096 Nov 17 11:28 experiments/
-rw-r--r-- 1 root root        11629 Sep 21 16:22 imagenet.py
drwxr-xr-x 2 root root         4096 Nov 17 11:32 model_checkpoints/

ubuntu@gpu-estibaliz:/hdd500/data/imagenet_tars/imagenet$ ls experiments/

custom_tfds.yaml  gpu.yaml  imagenet_resnet50_gpu.yaml  imagenet_resnet50_gpu_custom.yaml

ubuntu@gpu-estibaliz:/hdd500/data/imagenet_tars/imagenet$ sudo docker run --net=host  -it --gpus all -v /hdd500/data/imagenet_tars/imagenet:/root/tensorflow_datasets/downloads/manual manualresnet50 /bin/bash

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/

WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

root@gpu-estibaliz:/# ls /root/tensorflow_datasets/downloads/manual/

ILSVRC2012_devkit_t12  ILSVRC2012_devkit_t12.tar.gz  ILSVRC2012_img_train.tar  ILSVRC2012_img_val.tar  __pycache__  experiments  imagenet.py  model_checkpoints

root@gpu-estibaliz:/# python3 /usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py --experiment=resnet_imagenet --config_file=/root/tensorflow_datasets/downloads/manual/experiments/custom_tfds.yaml --mode=train_and_eval --model_dir=/root/tensorflow_datasets/downloads/manual/model_checkpoints

2021-11-17 11:55:13.435231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.449271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.449582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I1117 11:55:13.472259 140716787398464 train_utils.py:292] Final experiment parameters:
{'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': 'dynamic',
             'mixed_precision_dtype': 'float16',
             'num_cores_per_replica': 1,
             'num_gpus': 1,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'task': {'evaluation': {'top_k': 5},
          'init_checkpoint': None,
          'init_checkpoint_modules': 'all',
          'losses': {'l2_weight_decay': 0.0001,
                     'label_smoothing': 0.1,
                     'one_hot': True},
          'model': {'add_head_batch_norm': False,
                    'backbone': {'resnet': {'depth_multiplier': 1.0,
                                            'model_id': 50,
                                            'replace_stem_max_pool': False,
                                            'resnetd_shortcut': False,
                                            'se_ratio': 0.0,
                                            'stem_type': 'v0',
                                            'stochastic_depth_drop_rate': 0.0},
                                 'type': 'resnet'},
                    'dropout_rate': 0.0,
                    'input_size': [224, 224, 3],
                    'norm_activation': {'activation': 'relu',
                                        'norm_epsilon': 1e-05,
                                        'norm_momentum': 0.9,
                                        'use_sync_bn': False},
                    'num_classes': 1001},
          'model_output_keys': [],
          'train_data': {'aug_policy': None,
                         'aug_rand_hflip': True,
                         'aug_type': None,
                         'block_length': 1,
                         'cache': False,
                         'cycle_length': 10,
                         'decode_jpeg_only': True,
                         'deterministic': None,
                         'drop_remainder': True,
                         'dtype': 'float16',
                         'enable_tf_data_service': False,
                         'file_type': 'tfrecord',
                         'global_batch_size': 256,
                         'image_field_key': 'image/encoded',
                         'input_path': '',
                         'is_multilabel': False,
                         'is_training': True,
                         'label_field_key': 'image/class/label',
                         'randaug_magnitude': 10,
                         'seed': None,
                         'sharding': True,
                         'shuffle_buffer_size': 10000,
                         'tf_data_service_address': None,
                         'tf_data_service_job_name': None,
                         'tfds_as_supervised': False,
                         'tfds_data_dir': '',
                         'tfds_name': 'imagenet2012',
                         'tfds_skip_decoding_feature': '',
                         'tfds_split': 'train'},
          'validation_data': {'aug_policy': None,
                              'aug_rand_hflip': True,
                              'aug_type': None,
                              'block_length': 1,
                              'cache': False,
                              'cycle_length': 10,
                              'decode_jpeg_only': True,
                              'deterministic': None,
                              'drop_remainder': False,
                              'dtype': 'float16',
                              'enable_tf_data_service': False,
                              'file_type': 'tfrecord',
                              'global_batch_size': 256,
                              'image_field_key': 'image/encoded',
                              'input_path': '',
                              'is_multilabel': False,
                              'is_training': False,
                              'label_field_key': 'image/class/label',
                              'randaug_magnitude': 10,
                              'seed': None,
                              'sharding': True,
                              'shuffle_buffer_size': 10000,
                              'tf_data_service_address': None,
                              'tf_data_service_job_name': None,
                              'tfds_as_supervised': False,
                              'tfds_data_dir': '',
                              'tfds_name': 'imagenet2012',
                              'tfds_skip_decoding_feature': '',
                              'tfds_split': 'validation'}},
 'trainer': {'allow_tpu_summary': False,
             'best_checkpoint_eval_metric': '',
             'best_checkpoint_export_subdir': '',
             'best_checkpoint_metric_comp': 'higher',
             'checkpoint_interval': 625,
             'continuous_eval_timeout': 3600,
             'eval_tf_function': True,
             'eval_tf_while_loop': False,
             'loss_upper_bound': 1000000.0,
             'max_to_keep': 5,
             'optimizer_config': {'ema': None,
                                  'learning_rate': {'stepwise': {'boundaries': [18750,
                                                                                37500,
                                                                                50000],
                                                                 'name': 'PiecewiseConstantDecay',
                                                                 'offset': 0,
                                                                 'values': [0.8,
                                                                            0.08,
                                                                            0.008,
                                                                            0.0008]},
                                                    'type': 'stepwise'},
                                  'optimizer': {'sgd': {'clipnorm': None,
                                                        'clipvalue': None,
                                                        'decay': 0.0,
                                                        'global_clipnorm': None,
                                                        'momentum': 0.9,
                                                        'name': 'SGD',
                                                        'nesterov': False},
                                                'type': 'sgd'},
                                  'warmup': {'linear': {'name': 'linear',
                                                        'warmup_learning_rate': 0,
                                                        'warmup_steps': 3125},
                                             'type': 'linear'}},
             'recovery_begin_steps': 0,
             'recovery_max_trials': 0,
             'steps_per_loop': 625,
             'summary_interval': 625,
             'train_steps': 56250,
             'train_tf_function': True,
             'train_tf_while_loop': True,
             'validation_interval': 625,
             'validation_steps': 25,
             'validation_summary_subdir': 'validation'}}
I1117 11:55:13.474529 140716787398464 train_utils.py:303] Saving experiment configuration to /root/tensorflow_datasets/downloads/manual/model_checkpoints/params.yaml
2021-11-17 11:55:13.493205: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-PCIE-32GB, compute capability 7.0
I1117 11:55:13.493584 140716787398464 device_compatibility_check.py:121] Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-PCIE-32GB, compute capability 7.0
2021-11-17 11:55:13.494813: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-17 11:55:13.495692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.496071: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.496370: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.416536: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.416872: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.417060: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.417290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30997 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1117 11:55:15.080295 140716787398464 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1117 11:55:15.082595 140716787398464 train_utils.py:214] Running default trainer.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.146004 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.149153 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.152038 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.152982 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.159561 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.162897 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.428227 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.430712 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.434357 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.435827 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2021-11-17 11:55:17.963453: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
I1117 11:55:18.894689 140716787398464 dataset_info.py:443] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.1.0
I1117 11:55:19.755020 140716787398464 dataset_info.py:358] Load dataset info from /tmp/tmpnvfqzrkttfds
I1117 11:55:19.761893 140716787398464 dataset_info.py:413] Field info.description from disk and from code do not match. Keeping the one from code.
I1117 11:55:19.762345 140716787398464 dataset_info.py:413] Field info.module_name from disk and from code do not match. Keeping the one from code.
I1117 11:55:19.762815 140716787398464 dataset_builder.py:400] Generating dataset imagenet2012 (/root/tensorflow_datasets/imagenet2012/5.1.0)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 70, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 63, in main
    model_dir=model_dir)
  File "/usr/local/lib/python3.6/dist-packages/official/core/train_lib.py", line 78, in run_experiment
    params, model_dir))
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/core/train_utils.py", line 225, in create_trainer
    checkpoint_exporter=checkpoint_exporter)
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/core/base_trainer.py", line 259, in __init__
    self.task.build_inputs, self.config.task.train_data)
  File "/usr/local/lib/python3.6/dist-packages/official/core/base_trainer.py", line 159, in distribute_dataset
    *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/orbit/utils/common.py", line 85, in make_distributed_dataset
    return strategy.distribute_datasets_from_function(dataset_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1161, in distribute_datasets_from_function
    dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 589, in _distribute_datasets_from_function
    options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 169, in get_distributed_datasets_from_function
    input_contexts, dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1579, in __init__
    input_contexts, self._input_workers, dataset_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2327, in _create_datasets_from_function_with_input_context
    dataset = dataset_fn(ctx)
  File "/usr/local/lib/python3.6/dist-packages/orbit/utils/common.py", line 83, in dataset_fn
    return dataset_or_fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/tasks/image_classification.py", line 119, in build_inputs
    dataset = reader.read(input_context=input_context)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 415, in read
    self._tfds_builder)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 335, in _read_decode_and_parse_dataset
    dataset = self._read_tfds(input_context)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 268, in _read_tfds
    self._tfds_builder.download_and_prepare()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 409, in download_and_prepare
    self.info.dataset_size,
OSError: Not enough disk space. Needed: 155.84 GiB (download: Unknown size, generated: 155.84 GiB)
  In call to configurable 'Trainer' (<class 'official.core.base_trainer.Trainer'>)
  In call to configurable 'create_trainer' (<function create_trainer at 0x7ffaa809a1e0>)

root@gpu-estibaliz:/# df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay          79G   12G   63G  16% /
tmpfs            64M     0   64M   0% /dev
tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
/dev/vdb         79G   12G   63G  16% /etc/hosts
/dev/vdd1       492G  266G  201G  57% /root/tensorflow_datasets/downloads/manual
tmpfs           7.7G   12K  7.7G   1% /proc/driver/nvidia
/dev/vda1        14G  3.6G  9.9G  27% /usr/bin/nvidia-smi
tmpfs           1.6G  996K  1.6G   1% /run/nvidia-persistenced/socket
udev            7.7G     0  7.7G   0% /dev/nvidia0
tmpfs           7.7G     0  7.7G   0% /proc/acpi
tmpfs           7.7G     0  7.7G   0% /proc/scsi
tmpfs           7.7G     0  7.7G   0% /sys/firmware

It would be nice if... Anyone has any suggestion about what I missed.

Environment information Here is the Dockerfile:

# https://hub.docker.com/r/tensorflow/tensorflow
FROM tensorflow/tensorflow:2.6.0-gpu

RUN python3 -m pip install --upgrade pip

# https://github.com/tensorflow/models/tree/master/official
RUN pip install tf-models-official==2.6.0

# mount here the volume with imagenet downloaded data
RUN mkdir -p /root/tensorflow_datasets/downloads/manual
Conchylicultor commented 2 years ago

Rather than generating the imagenet inside the docker image, could your pre-generate imagenet .tfrecord (e.g. with tfds build imagenet2012 --manual_dir=... CLI), then only package the ~/tensorflow_datasets/imagenet2012/... rather than the original ILSVRC2012_devkit_t12.tar.gz

esparig commented 2 years ago

Are you saying that is not possible to use TFDS from a Tensorflow Docker container? Why? Is it a problem with TF or with Docker? I already did the conversion to .tfrecords using another script, but I wanted to try tfds... The thing is I need to use containers in that host.

Conchylicultor commented 2 years ago

I was just proposing a workaround. Could you try to comment the following line: https://github.com/tensorflow/datasets/blob/30024eefca3aa0783e2374af32766717267335d0/tensorflow_datasets/core/dataset_builder.py#L404

We're using shutil.disk_usage to estimate the available space. This might conflict for some reason with docker. Commenting the line will ignore the error, so we can be sure this is the cause.

esparig commented 2 years ago

This is strage, I did what you suggested and I get the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 23, in <module>
    from official.common import registry_imports
  File "/usr/local/lib/python3.6/dist-packages/official/common/registry_imports.py", line 18, in <module>
    from official.nlp.configs import experiment_configs
  File "/usr/local/lib/python3.6/dist-packages/official/nlp/configs/experiment_configs.py", line 17, in <module>
    from official.nlp.configs import finetuning_experiments
  File "/usr/local/lib/python3.6/dist-packages/official/nlp/configs/finetuning_experiments.py", line 20, in <module>
    from official.nlp.data import question_answering_dataloader
  File "/usr/local/lib/python3.6/dist-packages/official/nlp/data/question_answering_dataloader.py", line 21, in <module>
    from official.core import input_reader
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 21, in <module>
    import tensorflow_datasets as tfds
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/__init__.py", line 43, in <module>
    from tensorflow_datasets.core import tf_compat
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/__init__.py", line 26, in <module>
    from tensorflow_datasets.core import community  # pylint: disable=g-bad-import-order
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/community/__init__.py", line 18, in <module>
    from tensorflow_datasets.core.community.huggingface_wrapper import mock_builtin_to_use_gfile
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/community/huggingface_wrapper.py", line 28, in <module>
    from tensorflow_datasets.core import dataset_builder
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 44, in <module>
    from tensorflow_datasets.core.utils import file_utils
ImportError: cannot import name 'file_utils'

Checking the content of /usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/utils/I realized that there is no file_utils.py. I tried to solve the issue by installing last version of tensorflow-datasets by doing pip install tensorflow-datasets But it says that the requirement is already sastified:

 ---> Running in 2efd9d71d192
Requirement already satisfied: tensorflow-datasets in /usr/local/lib/python3.6/dist-packages (4.4.0)
Requirement already satisfied: termcolor in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (1.1.0)
Requirement already satisfied: dataclasses in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (0.8)
Requirement already satisfied: protobuf>=3.12.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (3.17.3)
Requirement already satisfied: promise in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (2.3)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (2.26.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (1.15.0)
Requirement already satisfied: tensorflow-metadata in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (1.2.0)
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (5.4.0)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (0.18.2)
Requirement already satisfied: dill in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (0.3.4)
Requirement already satisfied: attrs>=18.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (21.2.0)
Requirement already satisfied: absl-py in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (0.12.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (3.7.4.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (4.62.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets) (1.19.5)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow-datasets) (2021.5.30)
Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3/dist-packages (from requests>=2.19.0->tensorflow-datasets) (2.6)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow-datasets) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow-datasets) (1.26.6)
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.6/dist-packages (from importlib-resources->tensorflow-datasets) (3.5.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-metadata->tensorflow-datasets) (1.53.0)

Any ideas?

Conchylicultor commented 2 years ago

I think file_utils was added in tfds-nightly and is not yet present in tensoflow_datasets. You could try to pip install tfds-nightly

husenzhang commented 2 years ago

I was just proposing a workaround. Could you try to comment the following line:

https://github.com/tensorflow/datasets/blob/30024eefca3aa0783e2374af32766717267335d0/tensorflow_datasets/core/dataset_builder.py#L404

We're using shutil.disk_usage to estimate the available space. This might conflict for some reason with docker. Commenting the line will ignore the error, so we can be sure this is the cause.

Any new update on this issue? I have similliar problem:

" OSError: Not enough disk space. Needed: 155.84 GiB (download: Unknown size, generated: 155.84 GiB."
Filesystem      Size  Used Avail Use% Mounted on
overlay         582G   45G  538G   8% /
tmpfs            64M     0   64M   0% /dev
tmpfs            30G     0   30G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
/dev/xvda1      582G   45G  538G   8% /home
/dev/xvdf       400G  291G  110G  73% /home/tensorflow_datasets