tensorflow / models

Models and examples built with TensorFlow
Other
77.01k stars 45.78k forks source link

Trying to train ResNet50 from scratch, documentation is not clear #10258

Closed esparig closed 1 year ago

esparig commented 3 years ago

Prerequisites

Please answer the following question for yourself before submitting an issue.

1. The entire URL of the documentation with the issue

https://github.com/tensorflow/models/tree/master/official/vision/image_classification

2. Describe the issue

I can't reproduce the examples provided in the documentation. These are the steps I'm following:

a) sudo docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu /bin/bash b) python3 -m pip install --upgrade pip c) pip install tf-models-official d) download config files using curl (configs/examples/resnet/imagenet/gpu.yaml and configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml) e) execute the code provided:

python3 classifier_trainer.py \
  --mode=train_and_eval \
  --model_type=resnet \
  --dataset=imagenet \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --config_file=configs/examples/resnet/imagenet/gpu.yaml \
  --params_override='runtime.num_gpus=$NUM_GPUS'

As a result I'm getting:

2021-09-14 19:14:03.666015: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-14 19:14:03.675435: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-14 19:14:03.676603: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0914 19:14:03.682561 140080684554048 classifier_trainer.py:181] Base params: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
 'export': {'checkpoint': None, 'destination': None},
 'mode': None,
 'model': {'learning_rate': {'boundaries': [30, 60, 80],
                             'decay_epochs': None,
                             'decay_rate': None,
                             'examples_per_epoch': 1281167,
                             'initial_lr': 0.1,
                             'multipliers': [0.000390625,
                                             3.90625e-05,
                                             3.90625e-06,
                                             3.90625e-07],
                             'name': 'stepwise',
                             'scale_by_batch_size': 0.00390625,
                             'staircase': None,
                             'warmup_epochs': 5},
           'loss': {'label_smoothing': None,
                    'name': 'sparse_categorical_crossentropy'},
           'model_params': {'batch_size': None,
                            'num_classes': 1000,
                            'rescale_inputs': False,
                            'use_l2_regularizer': True},
           'name': 'ResNet',
           'num_classes': 1000,
           'optimizer': {'beta_1': None,
                         'beta_2': None,
                         'decay': 0.9,
                         'epsilon': 0.001,
                         'lookahead': None,
                         'momentum': 0.9,
                         'moving_average_decay': None,
                         'name': 'momentum',
                         'nesterov': None}},
 'model_dir': None,
 'model_name': None,
 'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': None,
             'num_cores_per_replica': 1,
             'num_gpus': 0,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'train': {'callbacks': {'enable_backup_and_restore': False,
                         'enable_checkpoint_and_export': True,
                         'enable_tensorboard': True,
                         'enable_time_history': True},
           'epochs': 90,
           'metrics': ['accuracy', 'top_5'],
           'resume_checkpoint': True,
           'set_epoch_loop': False,
           'steps': None,
           'tensorboard': {'track_lr': True, 'write_model_weights': False},
           'time_history': {'log_steps': 100}},
 'train_dataset': {'augmenter': {'name': None, 'params': None},
                   'batch_size': 128,
                   'builder': 'records',
                   'cache': False,
                   'data_dir': None,
                   'download': False,
                   'dtype': 'float32',
                   'file_shuffle_buffer_size': 1024,
                   'filenames': None,
                   'image_size': 224,
                   'mean_subtract': True,
                   'name': 'imagenet2012',
                   'num_channels': 3,
                   'num_classes': 1000,
                   'num_devices': 1,
                   'num_examples': 1281167,
                   'one_hot': False,
                   'shuffle_buffer_size': 10000,
                   'skip_decoding': True,
                   'split': 'train',
                   'standardize': True,
                   'tf_data_service': None,
                   'use_per_replica_batch_size': True},
 'validation_dataset': {'augmenter': {'name': None, 'params': None},
                        'batch_size': 128,
                        'builder': 'records',
                        'cache': False,
                        'data_dir': None,
                        'download': False,
                        'dtype': 'float32',
                        'file_shuffle_buffer_size': 1024,
                        'filenames': None,
                        'image_size': 224,
                        'mean_subtract': True,
                        'name': 'imagenet2012',
                        'num_channels': 3,
                        'num_classes': 1000,
                        'num_devices': 1,
                        'num_examples': 1281167,
                        'one_hot': False,
                        'shuffle_buffer_size': 10000,
                        'skip_decoding': True,
                        'split': 'validation',
                        'standardize': True,
                        'tf_data_service': None,
                        'use_per_replica_batch_size': True}}
I0914 19:14:03.683624 140080684554048 classifier_trainer.py:184] Overriding params: configs/examples/resnet/imagenet/gpu.yaml
I0914 19:14:03.690618 140080684554048 classifier_trainer.py:184] Overriding params: runtime.num_gpus=$NUM_GPUS
I0914 19:14:03.691445 140080684554048 classifier_trainer.py:184] Overriding params: {'model_dir': '', 'mode': 'train_and_eval', 'model': {'name': 'resnet'}, 'runtime': {'run_eagerly': None, 'tpu': None}, 'train_dataset': {'data_dir': ''}, 'validation_dataset': {'data_dir': ''}, 'train': {'time_history': {'log_steps': 100}}}
I0914 19:14:03.693601 140080684554048 classifier_trainer.py:190] Final model parameters: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
 'export': {'checkpoint': None, 'destination': None},
 'mode': 'train_and_eval',
 'model': {'learning_rate': {'boundaries': [30, 60, 80],
                             'decay_epochs': None,
                             'decay_rate': None,
                             'examples_per_epoch': 1281167,
                             'initial_lr': 0.1,
                             'multipliers': [0.000390625,
                                             3.90625e-05,
                                             3.90625e-06,
                                             3.90625e-07],
                             'name': 'stepwise',
                             'scale_by_batch_size': 0.00390625,
                             'staircase': None,
                             'warmup_epochs': 5},
           'loss': {'label_smoothing': 0.1,
                    'name': 'sparse_categorical_crossentropy'},
           'model_params': {'batch_size': None,
                            'num_classes': 1000,
                            'rescale_inputs': False,
                            'use_l2_regularizer': True},
           'name': 'resnet',
           'num_classes': 1000,
           'optimizer': {'beta_1': None,
                         'beta_2': None,
                         'decay': 0.9,
                         'epsilon': 0.001,
                         'lookahead': None,
                         'momentum': 0.9,
                         'moving_average_decay': None,
                         'name': 'momentum',
                         'nesterov': None}},
 'model_dir': '',
 'model_name': None,
 'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': True,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': None,
             'num_cores_per_replica': 1,
             'num_gpus': '$NUM_GPUS',
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': None,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'train': {'callbacks': {'enable_backup_and_restore': False,
                         'enable_checkpoint_and_export': True,
                         'enable_tensorboard': True,
                         'enable_time_history': True},
           'epochs': 90,
           'metrics': ['accuracy', 'top_5'],
           'resume_checkpoint': True,
           'set_epoch_loop': False,
           'steps': None,
           'tensorboard': {'track_lr': True, 'write_model_weights': False},
           'time_history': {'log_steps': 100}},
 'train_dataset': {'augmenter': {'name': None, 'params': None},
                   'batch_size': 256,
                   'builder': 'tfds',
                   'cache': False,
                   'data_dir': '',
                   'download': False,
                   'dtype': 'float16',
                   'file_shuffle_buffer_size': 1024,
                   'filenames': None,
                   'image_size': 224,
                   'mean_subtract': True,
                   'name': 'imagenet2012',
                   'num_channels': 3,
                   'num_classes': 1000,
                   'num_devices': 1,
                   'num_examples': 1281167,
                   'one_hot': False,
                   'shuffle_buffer_size': 10000,
                   'skip_decoding': True,
                   'split': 'train',
                   'standardize': True,
                   'tf_data_service': None,
                   'use_per_replica_batch_size': True},
 'validation_dataset': {'augmenter': {'name': None, 'params': None},
                        'batch_size': 256,
                        'builder': 'tfds',
                        'cache': False,
                        'data_dir': '',
                        'download': False,
                        'dtype': 'float16',
                        'file_shuffle_buffer_size': 1024,
                        'filenames': None,
                        'image_size': 224,
                        'mean_subtract': True,
                        'name': 'imagenet2012',
                        'num_channels': 3,
                        'num_classes': 1000,
                        'num_devices': 1,
                        'num_examples': 50000,
                        'one_hot': False,
                        'shuffle_buffer_size': 10000,
                        'skip_decoding': True,
                        'split': 'validation',
                        'standardize': True,
                        'tf_data_service': None,
                        'use_per_replica_batch_size': True}}
I0914 19:14:03.694338 140080684554048 classifier_trainer.py:290] Running train and eval.
Traceback (most recent call last):
  File "classifier_trainer.py", line 456, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "classifier_trainer.py", line 443, in main
    stats = run(flags.FLAGS)
  File "classifier_trainer.py", line 435, in run
    return train_and_eval(params, strategy_override)
  File "classifier_trainer.py", line 300, in train_and_eval
    tpu_address=params.runtime.tpu)
  File "/usr/local/lib/python3.6/dist-packages/official/common/distribute_utils.py", line 129, in get_distribution_strategy
    if num_gpus < 0:
TypeError: '<' not supported between instances of 'str' and 'int'
root@913b0753f9d6:/usr/local/lib/python3.6/dist-packages/official/vision/image_classification# pp.py", line 251, in _run_main
>     sys.exit(main(argv))
>   File "classifier_trainer.py", line 443, in main
>     stats = run(flags.FLAGS)
>   File "classifier_trainer.py", line 435, in run
>     return train_and_eval(params, strategy_override)
>   File "classifier_trainer.py", line 300, in train_and_eval
>     tpu_address=params.runtime.tpu)
>   File "/usr/local/lib/python3.6/dist-packages/official/common/distribute_utils.py", line 129, in get_distribution_strategy
>     if num_gpus < 0:
> TypeError: '<' not supported between instances of 'str' and 'int'

Changing the command to:

python3 classifier_trainer.py \
  --mode=train_and_eval \
  --model_type=resnet \
  --dataset=imagenet \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --config_file=configs/examples/resnet/imagenet/gpu.yaml \
  --params_override='runtime.num_gpus=1

I'm getting:

2021-09-14 19:16:45.876311: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-14 19:16:45.877754: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0914 19:16:45.886160 139841922975552 classifier_trainer.py:181] Base params: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
 'export': {'checkpoint': None, 'destination': None},
 'mode': None,
 'model': {'learning_rate': {'boundaries': [30, 60, 80],
                             'decay_epochs': None,
                             'decay_rate': None,
                             'examples_per_epoch': 1281167,
                             'initial_lr': 0.1,
                             'multipliers': [0.000390625,
                                             3.90625e-05,
                                             3.90625e-06,
                                             3.90625e-07],
                             'name': 'stepwise',
                             'scale_by_batch_size': 0.00390625,
                             'staircase': None,
                             'warmup_epochs': 5},
           'loss': {'label_smoothing': None,
                    'name': 'sparse_categorical_crossentropy'},
           'model_params': {'batch_size': None,
                            'num_classes': 1000,
                            'rescale_inputs': False,
                            'use_l2_regularizer': True},
           'name': 'ResNet',
           'num_classes': 1000,
           'optimizer': {'beta_1': None,
                         'beta_2': None,
                         'decay': 0.9,
                         'epsilon': 0.001,
                         'lookahead': None,
                         'momentum': 0.9,
                         'moving_average_decay': None,
                         'name': 'momentum',
                         'nesterov': None}},
 'model_dir': None,
 'model_name': None,
 'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': None,
             'num_cores_per_replica': 1,
             'num_gpus': 0,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'train': {'callbacks': {'enable_backup_and_restore': False,
                         'enable_checkpoint_and_export': True,
                         'enable_tensorboard': True,
                         'enable_time_history': True},
           'epochs': 90,
           'metrics': ['accuracy', 'top_5'],
           'resume_checkpoint': True,
           'set_epoch_loop': False,
           'steps': None,
           'tensorboard': {'track_lr': True, 'write_model_weights': False},
           'time_history': {'log_steps': 100}},
 'train_dataset': {'augmenter': {'name': None, 'params': None},
                   'batch_size': 128,
                   'builder': 'records',
                   'cache': False,
                   'data_dir': None,
                   'download': False,
                   'dtype': 'float32',
                   'file_shuffle_buffer_size': 1024,
                   'filenames': None,
                   'image_size': 224,
                   'mean_subtract': True,
                   'name': 'imagenet2012',
                   'num_channels': 3,
                   'num_classes': 1000,
                   'num_devices': 1,
                   'num_examples': 1281167,
                   'one_hot': False,
                   'shuffle_buffer_size': 10000,
                   'skip_decoding': True,
                   'split': 'train',
                   'standardize': True,
                   'tf_data_service': None,
                   'use_per_replica_batch_size': True},
 'validation_dataset': {'augmenter': {'name': None, 'params': None},
                        'batch_size': 128,
                        'builder': 'records',
                        'cache': False,
                        'data_dir': None,
                        'download': False,
                        'dtype': 'float32',
                        'file_shuffle_buffer_size': 1024,
                        'filenames': None,
                        'image_size': 224,
                        'mean_subtract': True,
                        'name': 'imagenet2012',
                        'num_channels': 3,
                        'num_classes': 1000,
                        'num_devices': 1,
                        'num_examples': 1281167,
                        'one_hot': False,
                        'shuffle_buffer_size': 10000,
                        'skip_decoding': True,
                        'split': 'validation',
                        'standardize': True,
                        'tf_data_service': None,
                        'use_per_replica_batch_size': True}}
I0914 19:16:45.887840 139841922975552 classifier_trainer.py:184] Overriding params: configs/examples/resnet/imagenet/gpu.yaml
I0914 19:16:45.897558 139841922975552 classifier_trainer.py:184] Overriding params: runtime.num_gpus=1
I0914 19:16:45.898580 139841922975552 classifier_trainer.py:184] Overriding params: {'model_dir': '', 'mode': 'train_and_eval', 'model': {'name': 'resnet'}, 'runtime': {'run_eagerly': None, 'tpu': None}, 'train_dataset': {'data_dir': ''}, 'validation_dataset': {'data_dir': ''}, 'train': {'time_history': {'log_steps': 100}}}
I0914 19:16:45.901514 139841922975552 classifier_trainer.py:190] Final model parameters: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
 'export': {'checkpoint': None, 'destination': None},
 'mode': 'train_and_eval',
 'model': {'learning_rate': {'boundaries': [30, 60, 80],
                             'decay_epochs': None,
                             'decay_rate': None,
                             'examples_per_epoch': 1281167,
                             'initial_lr': 0.1,
                             'multipliers': [0.000390625,
                                             3.90625e-05,
                                             3.90625e-06,
                                             3.90625e-07],
                             'name': 'stepwise',
                             'scale_by_batch_size': 0.00390625,
                             'staircase': None,
                             'warmup_epochs': 5},
           'loss': {'label_smoothing': 0.1,
                    'name': 'sparse_categorical_crossentropy'},
           'model_params': {'batch_size': None,
                            'num_classes': 1000,
                            'rescale_inputs': False,
                            'use_l2_regularizer': True},
           'name': 'resnet',
           'num_classes': 1000,
           'optimizer': {'beta_1': None,
                         'beta_2': None,
                         'decay': 0.9,
                         'epsilon': 0.001,
                         'lookahead': None,
                         'momentum': 0.9,
                         'moving_average_decay': None,
                         'name': 'momentum',
                         'nesterov': None}},
 'model_dir': '',
 'model_name': None,
 'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': True,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': None,
             'num_cores_per_replica': 1,
             'num_gpus': 1,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': None,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'train': {'callbacks': {'enable_backup_and_restore': False,
                         'enable_checkpoint_and_export': True,
                         'enable_tensorboard': True,
                         'enable_time_history': True},
           'epochs': 90,
           'metrics': ['accuracy', 'top_5'],
           'resume_checkpoint': True,
           'set_epoch_loop': False,
           'steps': None,
           'tensorboard': {'track_lr': True, 'write_model_weights': False},
           'time_history': {'log_steps': 100}},
 'train_dataset': {'augmenter': {'name': None, 'params': None},
                   'batch_size': 256,
                   'builder': 'tfds',
                   'cache': False,
                   'data_dir': '',
                   'download': False,
                   'dtype': 'float16',
                   'file_shuffle_buffer_size': 1024,
                   'filenames': None,
                   'image_size': 224,
                   'mean_subtract': True,
                   'name': 'imagenet2012',
                   'num_channels': 3,
                   'num_classes': 1000,
                   'num_devices': 1,
                   'num_examples': 1281167,
                   'one_hot': False,
                   'shuffle_buffer_size': 10000,
                   'skip_decoding': True,
                   'split': 'train',
                   'standardize': True,
                   'tf_data_service': None,
                   'use_per_replica_batch_size': True},
 'validation_dataset': {'augmenter': {'name': None, 'params': None},
                        'batch_size': 256,
                        'builder': 'tfds',
                        'cache': False,
                        'data_dir': '',
                        'download': False,
                        'dtype': 'float16',
                        'file_shuffle_buffer_size': 1024,
                        'filenames': None,
                        'image_size': 224,
                        'mean_subtract': True,
                        'name': 'imagenet2012',
                        'num_channels': 3,
                        'num_classes': 1000,
                        'num_devices': 1,
                        'num_examples': 50000,
                        'one_hot': False,
                        'shuffle_buffer_size': 10000,
                        'skip_decoding': True,
                        'split': 'validation',
                        'standardize': True,
                        'tf_data_service': None,
                        'use_per_replica_batch_size': True}}
I0914 19:16:45.901775 139841922975552 classifier_trainer.py:290] Running train and eval.
2021-09-14 19:16:45.903073: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-14 19:16:45.903737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-14 19:16:45.905226: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-14 19:16:45.906560: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-14 19:16:47.033764: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-14 19:16:47.034708: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-14 19:16:47.035731: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-14 19:16:47.036855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30995 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0914 19:16:47.995822 139841922975552 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0914 19:16:47.997327 139841922975552 classifier_trainer.py:305] Detected 1 devices.
W0914 19:16:47.997431 139841922975552 classifier_trainer.py:105] label_smoothing > 0, so datasets will be one hot encoded.
I0914 19:16:47.997728 139841922975552 dataset_factory.py:176] Using augmentation: None
I0914 19:16:47.998001 139841922975552 dataset_factory.py:176] Using augmentation: None
I0914 19:16:47.998326 139841922975552 dataset_factory.py:341] Using TFDS to load data.
2021-09-14 19:16:48.004060: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
I0914 19:16:48.669551 139841922975552 dataset_info.py:443] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.1.0
I0914 19:16:49.586393 139841922975552 dataset_info.py:358] Load dataset info from /tmp/tmp1h_9dllttfds
I0914 19:16:49.595921 139841922975552 dataset_info.py:413] Field info.description from disk and from code do not match. Keeping the one from code.
I0914 19:16:49.596311 139841922975552 dataset_info.py:413] Field info.module_name from disk and from code do not match. Keeping the one from code.
I0914 19:16:49.596735 139841922975552 logging_logger.py:36] Constructing tf.data.Dataset imagenet2012 for split train, from /root/tensorflow_datasets/imagenet2012/5.1.0
Traceback (most recent call last):
  File "classifier_trainer.py", line 456, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "classifier_trainer.py", line 443, in main
    stats = run(flags.FLAGS)
  File "classifier_trainer.py", line 435, in run
    return train_and_eval(params, strategy_override)
  File "classifier_trainer.py", line 312, in train_and_eval
    builder.build(strategy) if builder else None for builder in builders
  File "classifier_trainer.py", line 312, in <listcomp>
    builder.build(strategy) if builder else None for builder in builders
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 302, in build
    dataset = strategy.distribute_datasets_from_function(self._build)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1161, in distribute_datasets_from_function
    dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 589, in _distribute_datasets_from_function
    options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 169, in get_distributed_datasets_from_function
    input_contexts, dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1579, in __init__
    input_contexts, self._input_workers, dataset_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2327, in _create_datasets_from_function_with_input_context
    dataset = dataset_fn(ctx)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 333, in _build
    dataset = builder()
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 363, in load_tfds
    read_config=read_config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/logging/__init__.py", line 81, in decorator
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 546, in as_dataset
    (self.name, self._data_dir_root))
AssertionError: Dataset imagenet2012: could not find data in /root/tensorflow_datasets. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.

I don't know how to fix this, I think documentation is not very clear, please help. I wouldn't mind to use the new code base in beta, but there is even less documentation.

esparig commented 3 years ago

Similarly for EfficientNet:

root@745616268d5f:/usr/local/lib/python3.6/dist-packages/official/vision/image_classification# python3 classifier_trainer.py   --mode=train_and_eval   --model_type=efficientnet   --dataset=imagenet   --model_dir=$MODEL_DIR   --data_dir=$DATA_DIR   --config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml   --params_override='runtime.num_gpus=1'
2021-09-15 06:42:04.619103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:04.629605: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:04.631699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0915 06:42:04.638748 140551238154048 classifier_trainer.py:181] Base params: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
 'export': {'checkpoint': None, 'destination': None},
 'mode': None,
 'model': {'learning_rate': {'boundaries': None,
                             'decay_epochs': 2.4,
                             'decay_rate': 0.97,
                             'examples_per_epoch': None,
                             'initial_lr': 0.008,
                             'multipliers': None,
                             'name': 'exponential',
                             'scale_by_batch_size': 0.0078125,
                             'staircase': True,
                             'warmup_epochs': 5},
           'loss': {'label_smoothing': 0.1, 'name': 'categorical_crossentropy'},
           'model_params': {'model_name': 'efficientnet-b0',
                            'model_weights_path': '',
                            'overrides': {'activation': 'swish',
                                          'batch_norm': 'default',
                                          'dtype': 'float32',
                                          'num_classes': 1000,
                                          'rescale_input': True},
                            'weights_format': 'saved_model'},
           'name': 'EfficientNet',
           'num_classes': 1000,
           'optimizer': {'beta_1': None,
                         'beta_2': None,
                         'decay': 0.9,
                         'epsilon': 0.001,
                         'lookahead': None,
                         'momentum': 0.9,
                         'moving_average_decay': None,
                         'name': 'rmsprop',
                         'nesterov': None}},
 'model_dir': None,
 'model_name': None,
 'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': None,
             'num_cores_per_replica': 1,
             'num_gpus': 0,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'train': {'callbacks': {'enable_backup_and_restore': False,
                         'enable_checkpoint_and_export': True,
                         'enable_tensorboard': True,
                         'enable_time_history': True},
           'epochs': 500,
           'metrics': ['accuracy', 'top_5'],
           'resume_checkpoint': True,
           'set_epoch_loop': False,
           'steps': None,
           'tensorboard': {'track_lr': True, 'write_model_weights': False},
           'time_history': {'log_steps': 100}},
 'train_dataset': {'augmenter': {'name': None, 'params': None},
                   'batch_size': 128,
                   'builder': 'records',
                   'cache': False,
                   'data_dir': None,
                   'download': False,
                   'dtype': 'float32',
                   'file_shuffle_buffer_size': 1024,
                   'filenames': None,
                   'image_size': 224,
                   'mean_subtract': False,
                   'name': 'imagenet2012',
                   'num_channels': 3,
                   'num_classes': 1000,
                   'num_devices': 1,
                   'num_examples': 1281167,
                   'one_hot': True,
                   'shuffle_buffer_size': 10000,
                   'skip_decoding': True,
                   'split': 'train',
                   'standardize': False,
                   'tf_data_service': None,
                   'use_per_replica_batch_size': True},
 'validation_dataset': {'augmenter': {'name': None, 'params': None},
                        'batch_size': 128,
                        'builder': 'records',
                        'cache': False,
                        'data_dir': None,
                        'download': False,
                        'dtype': 'float32',
                        'file_shuffle_buffer_size': 1024,
                        'filenames': None,
                        'image_size': 224,
                        'mean_subtract': False,
                        'name': 'imagenet2012',
                        'num_channels': 3,
                        'num_classes': 1000,
                        'num_devices': 1,
                        'num_examples': 1281167,
                        'one_hot': True,
                        'shuffle_buffer_size': 10000,
                        'skip_decoding': True,
                        'split': 'validation',
                        'standardize': False,
                        'tf_data_service': None,
                        'use_per_replica_batch_size': True}}
I0915 06:42:04.642026 140551238154048 classifier_trainer.py:184] Overriding params: configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml
I0915 06:42:04.653718 140551238154048 classifier_trainer.py:184] Overriding params: runtime.num_gpus=1
I0915 06:42:04.654549 140551238154048 classifier_trainer.py:184] Overriding params: {'model_dir': '', 'mode': 'train_and_eval', 'model': {'name': 'efficientnet'}, 'runtime': {'run_eagerly': None, 'tpu': None}, 'train_dataset': {'data_dir': ''}, 'validation_dataset': {'data_dir': ''}, 'train': {'time_history': {'log_steps': 100}}}
I0915 06:42:04.656747 140551238154048 classifier_trainer.py:190] Final model parameters: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
 'export': {'checkpoint': None, 'destination': None},
 'mode': 'train_and_eval',
 'model': {'learning_rate': {'boundaries': None,
                             'decay_epochs': 2.4,
                             'decay_rate': 0.97,
                             'examples_per_epoch': None,
                             'initial_lr': 0.008,
                             'multipliers': None,
                             'name': 'exponential',
                             'scale_by_batch_size': 0.0078125,
                             'staircase': True,
                             'warmup_epochs': 5},
           'loss': {'label_smoothing': 0.1, 'name': 'categorical_crossentropy'},
           'model_params': {'model_name': 'efficientnet-b0',
                            'model_weights_path': '',
                            'overrides': {'activation': 'swish',
                                          'batch_norm': 'default',
                                          'dtype': 'float32',
                                          'num_classes': 1000,
                                          'rescale_input': True},
                            'weights_format': 'saved_model'},
           'name': 'efficientnet',
           'num_classes': 1000,
           'optimizer': {'beta_1': None,
                         'beta_2': None,
                         'decay': 0.9,
                         'epsilon': 0.001,
                         'lookahead': False,
                         'momentum': 0.9,
                         'moving_average_decay': 0.0,
                         'name': 'rmsprop',
                         'nesterov': None}},
 'model_dir': '',
 'model_name': None,
 'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': None,
             'num_cores_per_replica': 1,
             'num_gpus': 1,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': None,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'train': {'callbacks': {'enable_backup_and_restore': False,
                         'enable_checkpoint_and_export': True,
                         'enable_tensorboard': True,
                         'enable_time_history': True},
           'epochs': 500,
           'metrics': ['accuracy', 'top_5'],
           'resume_checkpoint': True,
           'set_epoch_loop': False,
           'steps': None,
           'tensorboard': {'track_lr': True, 'write_model_weights': False},
           'time_history': {'log_steps': 100}},
 'train_dataset': {'augmenter': {'name': 'autoaugment', 'params': None},
                   'batch_size': 32,
                   'builder': 'records',
                   'cache': False,
                   'data_dir': '',
                   'download': False,
                   'dtype': 'float32',
                   'file_shuffle_buffer_size': 1024,
                   'filenames': None,
                   'image_size': 224,
                   'mean_subtract': False,
                   'name': 'imagenet2012',
                   'num_channels': 3,
                   'num_classes': 1000,
                   'num_devices': 1,
                   'num_examples': 1281167,
                   'one_hot': True,
                   'shuffle_buffer_size': 10000,
                   'skip_decoding': True,
                   'split': 'train',
                   'standardize': False,
                   'tf_data_service': None,
                   'use_per_replica_batch_size': True},
 'validation_dataset': {'augmenter': {'name': None, 'params': None},
                        'batch_size': 32,
                        'builder': 'records',
                        'cache': False,
                        'data_dir': '',
                        'download': False,
                        'dtype': 'float32',
                        'file_shuffle_buffer_size': 1024,
                        'filenames': None,
                        'image_size': 224,
                        'mean_subtract': False,
                        'name': 'imagenet2012',
                        'num_channels': 3,
                        'num_classes': 1000,
                        'num_devices': 1,
                        'num_examples': 50000,
                        'one_hot': True,
                        'shuffle_buffer_size': 10000,
                        'skip_decoding': True,
                        'split': 'validation',
                        'standardize': False,
                        'tf_data_service': None,
                        'use_per_replica_batch_size': True}}
I0915 06:42:04.657643 140551238154048 classifier_trainer.py:290] Running train and eval.
2021-09-15 06:42:04.658841: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-15 06:42:04.659902: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:04.661132: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:04.662254: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:05.669043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:05.670517: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:05.671733: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:05.672929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30995 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0915 06:42:06.720428 140551238154048 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0915 06:42:06.722068 140551238154048 classifier_trainer.py:305] Detected 1 devices.
W0915 06:42:06.722257 140551238154048 classifier_trainer.py:105] label_smoothing > 0, so datasets will be one hot encoded.
I0915 06:42:06.722772 140551238154048 dataset_factory.py:176] Using augmentation: autoaugment
I0915 06:42:06.723030 140551238154048 dataset_factory.py:176] Using augmentation: None
I0915 06:42:06.723595 140551238154048 dataset_factory.py:369] Using TFRecords to load data.
Traceback (most recent call last):
  File "classifier_trainer.py", line 456, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "classifier_trainer.py", line 443, in main
    stats = run(flags.FLAGS)
  File "classifier_trainer.py", line 435, in run
    return train_and_eval(params, strategy_override)
  File "classifier_trainer.py", line 312, in train_and_eval
    builder.build(strategy) if builder else None for builder in builders
  File "classifier_trainer.py", line 312, in <listcomp>
    builder.build(strategy) if builder else None for builder in builders
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 302, in build
    dataset = strategy.distribute_datasets_from_function(self._build)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1161, in distribute_datasets_from_function
    dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 589, in _distribute_datasets_from_function
    options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 169, in get_distributed_datasets_from_function
    input_contexts, dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1579, in __init__
    input_contexts, self._input_workers, dataset_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2327, in _create_datasets_from_function_with_input_context
    dataset = dataset_fn(ctx)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 333, in _build
    dataset = builder()
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 376, in load_records
    dataset = tf.data.Dataset.list_files(file_pattern, shuffle=False)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 1230, in list_files
    condition, [message], summarize=1, name="assert_not_empty")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/tf_should_use.py", line 247, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 164, in Assert
    (condition, "\n".join(data_str)))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected 'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'No files matched pattern: train*'
saberkun commented 3 years ago

Do you have imagenet dataset processed ready? The tutorials don't contain processed imagenet dataset tfrecords because of the license issue.

esparig commented 3 years ago

Yes, I figured out I needed the tfrecords, I did this:

Dataset preparation

Download Imagenet2012: https://image-net.org/

export IMAGENET_HOME=<my_imagenet_folder>

Setup folders

mkdir -p $IMAGENET_HOME/validation mkdir -p $IMAGENET_HOME/train

Extract validation and training

tar xf ILSVRC2012_img_val.tar -C $IMAGENET_HOME/validation tar xf ILSVRC2012_img_train.tar -C $IMAGENET_HOME/train

cd $IMAGENET_HOME/train

for f in *.tar; do
  d=`basename $f .tar`
  mkdir $d
  tar xf $f -C $d
  rm $f #removes previously extracted tar
done

Download labels file

wget -O $IMAGENET_HOME/synset_labels.txt \ https://raw.githubusercontent.com/tensorflow/models/c7df5a3dde886509fbd1c7b317f76fb876f23506/research/inception/inception/data/imagenet_2012_validation_synset_labels.txt

Download imagenet_to_gcs script

wget -O $IMAGENET_HOME/imagenet_to_gcs.py \ https://github.com/tensorflow/tpu/blob/8cca0ff35e1d8c6fcd1dfac98978495ff2cadb84/tools/datasets/imagenet_to_gcs.py

Run Tensorflow docker container

sudo docker run -v $IMAGENET_HOME:/data/imagenet --gpus all -it --rm tensorflow/tensorflow:latest-gpu /bin/bash

Run inside the container

python3 -m pip install --upgrade pip &&\ pip install tf-models-official &&\ pip install gcloud google-cloud-storage

Set Imagenet folder

export IMAGENET_HOME=<my_imagenet_folder>

Process the files

Remember to get the script from github first. The TFRecords will end up in the --local_scratch_dir. To upload to gcs with this method leave off nogcs_upload and provide gcs flags for project and output_path. python3 imagenet_to_gcs.py \ --raw_data_dir=$IMAGENET_HOME \ --local_scratch_dir=$IMAGENET_HOME/tfrecord \ --nogcs_upload

So far I have train and validation folders, containing the tfrecords in /data/imagenet/imagenet2012/5.1.0/

Trying to run Resnet50

Using classifier_trainer.py I get the following error now:

python3 /usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py   --mode=train_and_eval   --model_type=resnet   --dataset=imagenet   --model_dir=$MODEL_DIR   --data_dir=/data/imagenet   --config_file=/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/configs/examples/resnet/imagenet/gpu.yaml   --params_override='runtime.num_gpus=1'
2021-09-28 11:25:20.797826: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:20.810368: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:20.811266: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0928 11:25:20.815533 140110323226432 classifier_trainer.py:181] Base params: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
 'export': {'checkpoint': None, 'destination': None},
 'mode': None,
 'model': {'learning_rate': {'boundaries': [30, 60, 80],
                             'decay_epochs': None,
                             'decay_rate': None,
                             'examples_per_epoch': 1281167,
                             'initial_lr': 0.1,
                             'multipliers': [0.000390625,
                                             3.90625e-05,
                                             3.90625e-06,
                                             3.90625e-07],
                             'name': 'stepwise',
                             'scale_by_batch_size': 0.00390625,
                             'staircase': None,
                             'warmup_epochs': 5},
           'loss': {'label_smoothing': None,
                    'name': 'sparse_categorical_crossentropy'},
           'model_params': {'batch_size': None,
                            'num_classes': 1000,
                            'rescale_inputs': False,
                            'use_l2_regularizer': True},
           'name': 'ResNet',
           'num_classes': 1000,
           'optimizer': {'beta_1': None,
                         'beta_2': None,
                         'decay': 0.9,
                         'epsilon': 0.001,
                         'lookahead': None,
                         'momentum': 0.9,
                         'moving_average_decay': None,
                         'name': 'momentum',
                         'nesterov': None}},
 'model_dir': None,
 'model_name': None,
 'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': None,
             'num_cores_per_replica': 1,
             'num_gpus': 0,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'train': {'callbacks': {'enable_backup_and_restore': False,
                         'enable_checkpoint_and_export': True,
                         'enable_tensorboard': True,
                         'enable_time_history': True},
           'epochs': 90,
           'metrics': ['accuracy', 'top_5'],
           'resume_checkpoint': True,
           'set_epoch_loop': False,
           'steps': None,
           'tensorboard': {'track_lr': True, 'write_model_weights': False},
           'time_history': {'log_steps': 100}},
 'train_dataset': {'augmenter': {'name': None, 'params': None},
                   'batch_size': 128,
                   'builder': 'records',
                   'cache': False,
                   'data_dir': None,
                   'download': False,
                   'dtype': 'float32',
                   'file_shuffle_buffer_size': 1024,
                   'filenames': None,
                   'image_size': 224,
                   'mean_subtract': True,
                   'name': 'imagenet2012',
                   'num_channels': 3,
                   'num_classes': 1000,
                   'num_devices': 1,
                   'num_examples': 1281167,
                   'one_hot': False,
                   'shuffle_buffer_size': 10000,
                   'skip_decoding': True,
                   'split': 'train',
                   'standardize': True,
                   'tf_data_service': None,
                   'use_per_replica_batch_size': True},
 'validation_dataset': {'augmenter': {'name': None, 'params': None},
                        'batch_size': 128,
                        'builder': 'records',
                        'cache': False,
                        'data_dir': None,
                        'download': False,
                        'dtype': 'float32',
                        'file_shuffle_buffer_size': 1024,
                        'filenames': None,
                        'image_size': 224,
                        'mean_subtract': True,
                        'name': 'imagenet2012',
                        'num_channels': 3,
                        'num_classes': 1000,
                        'num_devices': 1,
                        'num_examples': 1281167,
                        'one_hot': False,
                        'shuffle_buffer_size': 10000,
                        'skip_decoding': True,
                        'split': 'validation',
                        'standardize': True,
                        'tf_data_service': None,
                        'use_per_replica_batch_size': True}}
I0928 11:25:20.816565 140110323226432 classifier_trainer.py:184] Overriding params: /usr/local/lib/python3.6/dist-packages/official/vision/image_classification/configs/examples/resnet/imagenet/gpu.yaml
I0928 11:25:20.822834 140110323226432 classifier_trainer.py:184] Overriding params: runtime.num_gpus=1
I0928 11:25:20.823509 140110323226432 classifier_trainer.py:184] Overriding params: {'model_dir': '/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/resnet', 'mode': 'train_and_eval', 'model': {'name': 'resnet'}, 'runtime': {'run_eagerly': None, 'tpu': None}, 'train_dataset': {'data_dir': '/data/imagenet'}, 'validation_dataset': {'data_dir': '/data/imagenet'}, 'train': {'time_history': {'log_steps': 100}}}
I0928 11:25:20.825434 140110323226432 classifier_trainer.py:190] Final model parameters: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
 'export': {'checkpoint': None, 'destination': None},
 'mode': 'train_and_eval',
 'model': {'learning_rate': {'boundaries': [30, 60, 80],
                             'decay_epochs': None,
                             'decay_rate': None,
                             'examples_per_epoch': 1281167,
                             'initial_lr': 0.1,
                             'multipliers': [0.000390625,
                                             3.90625e-05,
                                             3.90625e-06,
                                             3.90625e-07],
                             'name': 'stepwise',
                             'scale_by_batch_size': 0.00390625,
                             'staircase': None,
                             'warmup_epochs': 5},
           'loss': {'label_smoothing': 0.1,
                    'name': 'sparse_categorical_crossentropy'},
           'model_params': {'batch_size': None,
                            'num_classes': 1000,
                            'rescale_inputs': False,
                            'use_l2_regularizer': True},
           'name': 'resnet',
           'num_classes': 1000,
           'optimizer': {'beta_1': None,
                         'beta_2': None,
                         'decay': 0.9,
                         'epsilon': 0.001,
                         'lookahead': None,
                         'momentum': 0.9,
                         'moving_average_decay': None,
                         'name': 'momentum',
                         'nesterov': None}},
 'model_dir': '/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/resnet',
 'model_name': None,
 'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': True,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': None,
             'num_cores_per_replica': 1,
             'num_gpus': 1,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': None,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'train': {'callbacks': {'enable_backup_and_restore': False,
                         'enable_checkpoint_and_export': True,
                         'enable_tensorboard': True,
                         'enable_time_history': True},
           'epochs': 1,
           'metrics': ['accuracy', 'top_5'],
           'resume_checkpoint': True,
           'set_epoch_loop': False,
           'steps': None,
           'tensorboard': {'track_lr': True, 'write_model_weights': False},
           'time_history': {'log_steps': 100}},
 'train_dataset': {'augmenter': {'name': None, 'params': None},
                   'batch_size': 256,
                   'builder': 'tfds',
                   'cache': False,
                   'data_dir': '/data/imagenet',
                   'download': False,
                   'dtype': 'float16',
                   'file_shuffle_buffer_size': 1024,
                   'filenames': None,
                   'image_size': 224,
                   'mean_subtract': True,
                   'name': 'imagenet2012',
                   'num_channels': 3,
                   'num_classes': 1000,
                   'num_devices': 1,
                   'num_examples': 1281167,
                   'one_hot': False,
                   'shuffle_buffer_size': 10000,
                   'skip_decoding': True,
                   'split': 'train',
                   'standardize': True,
                   'tf_data_service': None,
                   'use_per_replica_batch_size': True},
 'validation_dataset': {'augmenter': {'name': None, 'params': None},
                        'batch_size': 256,
                        'builder': 'tfds',
                        'cache': False,
                        'data_dir': '/data/imagenet',
                        'download': False,
                        'dtype': 'float16',
                        'file_shuffle_buffer_size': 1024,
                        'filenames': None,
                        'image_size': 224,
                        'mean_subtract': True,
                        'name': 'imagenet2012',
                        'num_channels': 3,
                        'num_classes': 1000,
                        'num_devices': 1,
                        'num_examples': 50000,
                        'one_hot': False,
                        'shuffle_buffer_size': 10000,
                        'skip_decoding': True,
                        'split': 'validation',
                        'standardize': True,
                        'tf_data_service': None,
                        'use_per_replica_batch_size': True}}
I0928 11:25:20.825589 140110323226432 classifier_trainer.py:290] Running train and eval.
2021-09-28 11:25:20.826318: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-28 11:25:20.826680: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:20.827574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:20.828412: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:21.797719: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:21.798760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:21.799737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:21.800690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30995 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0928 11:25:22.858247 140110323226432 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0928 11:25:22.859499 140110323226432 classifier_trainer.py:305] Detected 1 devices.
W0928 11:25:22.859601 140110323226432 classifier_trainer.py:105] label_smoothing > 0, so datasets will be one hot encoded.
I0928 11:25:22.859818 140110323226432 dataset_factory.py:176] Using augmentation: None
I0928 11:25:22.859973 140110323226432 dataset_factory.py:176] Using augmentation: None
I0928 11:25:22.860244 140110323226432 dataset_factory.py:341] Using TFDS to load data.
I0928 11:25:22.863063 140110323226432 dataset_info.py:358] Load dataset info from /data/imagenet/imagenet2012/5.1.0
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/utils/py_utils.py", line 392, in try_reraise
    yield
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/load.py", line 166, in builder
    return cls(**builder_kwargs)  # pytype: disable=not-instantiable
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 900, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 182, in __init__
    self.info.read_from_directory(self._data_dir)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_info.py", line 363, in read_from_directory
    "Try to load `DatasetInfo` from a directory which does not exist or "
FileNotFoundError: Try to load `DatasetInfo` from a directory which does not exist or does not contain `dataset_info.json`. Please delete the directory `/data/imagenet/imagenet2012/5.1.0`  if you are trying to re-generate the dataset.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 456, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 443, in main
    stats = run(flags.FLAGS)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 435, in run
    return train_and_eval(params, strategy_override)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 312, in train_and_eval
    builder.build(strategy) if builder else None for builder in builders
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 312, in <listcomp>
    builder.build(strategy) if builder else None for builder in builders
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 302, in build
    dataset = strategy.distribute_datasets_from_function(self._build)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1161, in distribute_datasets_from_function
    dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 589, in _distribute_datasets_from_function
    options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 169, in get_distributed_datasets_from_function
    input_contexts, dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1579, in __init__
    input_contexts, self._input_workers, dataset_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2327, in _create_datasets_from_function_with_input_context
    dataset = dataset_fn(ctx)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 333, in _build
    dataset = builder()
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 343, in load_tfds
    builder = tfds.builder(self.config.name, data_dir=self.config.data_dir)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/load.py", line 166, in builder
    return cls(**builder_kwargs)  # pytype: disable=not-instantiable
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/utils/py_utils.py", line 394, in try_reraise
    reraise(e, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/utils/py_utils.py", line 361, in reraise
    raise exception from e
FileNotFoundError: Failed to construct dataset imagenet2012: Try to load `DatasetInfo` from a directory which does not exist or does not contain `dataset_info.json`. Please delete the directory `/data/imagenet/imagenet2012/5.1.0`  if you are trying to re-generate the dataset.

Using another configuration file, the one in experiments folder:

# python3 /usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py   --mode=train_and_eval   --model_type=resnet   --dataset=imagenet   --model_dir=$MODEL_DIR   --data_dir=/data/imagenet   --config_file=/data/imagenet/experiment/imagenet_resnet50_gpu.yaml   --params_override='runtime.num_gpus=1'
2021-09-28 12:04:04.909335: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 12:04:04.921396: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 12:04:04.922776: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0928 12:04:04.929126 139733132896064 classifier_trainer.py:181] Base params: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
 'export': {'checkpoint': None, 'destination': None},
 'mode': None,
 'model': {'learning_rate': {'boundaries': [30, 60, 80],
                             'decay_epochs': None,
                             'decay_rate': None,
                             'examples_per_epoch': 1281167,
                             'initial_lr': 0.1,
                             'multipliers': [0.000390625,
                                             3.90625e-05,
                                             3.90625e-06,
                                             3.90625e-07],
                             'name': 'stepwise',
                             'scale_by_batch_size': 0.00390625,
                             'staircase': None,
                             'warmup_epochs': 5},
           'loss': {'label_smoothing': None,
                    'name': 'sparse_categorical_crossentropy'},
           'model_params': {'batch_size': None,
                            'num_classes': 1000,
                            'rescale_inputs': False,
                            'use_l2_regularizer': True},
           'name': 'ResNet',
           'num_classes': 1000,
           'optimizer': {'beta_1': None,
                         'beta_2': None,
                         'decay': 0.9,
                         'epsilon': 0.001,
                         'lookahead': None,
                         'momentum': 0.9,
                         'moving_average_decay': None,
                         'name': 'momentum',
                         'nesterov': None}},
 'model_dir': None,
 'model_name': None,
 'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': None,
             'num_cores_per_replica': 1,
             'num_gpus': 0,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'train': {'callbacks': {'enable_backup_and_restore': False,
                         'enable_checkpoint_and_export': True,
                         'enable_tensorboard': True,
                         'enable_time_history': True},
           'epochs': 90,
           'metrics': ['accuracy', 'top_5'],
           'resume_checkpoint': True,
           'set_epoch_loop': False,
           'steps': None,
           'tensorboard': {'track_lr': True, 'write_model_weights': False},
           'time_history': {'log_steps': 100}},
 'train_dataset': {'augmenter': {'name': None, 'params': None},
                   'batch_size': 128,
                   'builder': 'records',
                   'cache': False,
                   'data_dir': None,
                   'download': False,
                   'dtype': 'float32',
                   'file_shuffle_buffer_size': 1024,
                   'filenames': None,
                   'image_size': 224,
                   'mean_subtract': True,
                   'name': 'imagenet2012',
                   'num_channels': 3,
                   'num_classes': 1000,
                   'num_devices': 1,
                   'num_examples': 1281167,
                   'one_hot': False,
                   'shuffle_buffer_size': 10000,
                   'skip_decoding': True,
                   'split': 'train',
                   'standardize': True,
                   'tf_data_service': None,
                   'use_per_replica_batch_size': True},
 'validation_dataset': {'augmenter': {'name': None, 'params': None},
                        'batch_size': 128,
                        'builder': 'records',
                        'cache': False,
                        'data_dir': None,
                        'download': False,
                        'dtype': 'float32',
                        'file_shuffle_buffer_size': 1024,
                        'filenames': None,
                        'image_size': 224,
                        'mean_subtract': True,
                        'name': 'imagenet2012',
                        'num_channels': 3,
                        'num_classes': 1000,
                        'num_devices': 1,
                        'num_examples': 1281167,
                        'one_hot': False,
                        'shuffle_buffer_size': 10000,
                        'skip_decoding': True,
                        'split': 'validation',
                        'standardize': True,
                        'tf_data_service': None,
                        'use_per_replica_batch_size': True}}
I0928 12:04:04.930369 139733132896064 classifier_trainer.py:184] Overriding params: /data/imagenet/experiment/imagenet_resnet50_gpu.yaml
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 456, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 443, in main
    stats = run(flags.FLAGS)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 433, in run
    params = _get_params_from_flags(flags_obj)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 185, in _get_params_from_flags
    params = hyperparams.override_params_dict(params, param, is_strict=True)
  File "/usr/local/lib/python3.6/dist-packages/official/modeling/hyperparams/params_dict.py", line 461, in override_params_dict
    params.override(yaml.load(f, Loader=yaml.FullLoader), is_strict)
  File "/usr/local/lib/python3.6/dist-packages/official/modeling/hyperparams/params_dict.py", line 181, in override
    self._override(override_params, is_strict)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/official/modeling/hyperparams/base_config.py", line 219, in _override
    k, type(self)))
KeyError: "The key 'task' does not exist in <class 'official.vision.image_classification.configs.configs.ResNetImagenetConfig'>. To extend the existing keys, use `override` with `is_strict` = False."

Using train.py I get the following error:

# python3 /usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py --experiment=/data/imagenet/experiment/imagenet_resnet50_gpu.yaml --mode=train_and_eval --model_dir=$MODEL_DIR
2021-09-28 11:28:20.151457: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:28:20.170358: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:28:20.171795: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 70, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 37, in main
    params = train_utils.parse_configuration(FLAGS)
  File "/usr/local/lib/python3.6/dist-packages/official/core/train_utils.py", line 248, in parse_configuration
    params = exp_factory.get_exp_config(flags_obj.experiment)
  File "/usr/local/lib/python3.6/dist-packages/official/core/exp_factory.py", line 36, in get_exp_config
    return get_exp_config_creater(exp_name)()
  File "/usr/local/lib/python3.6/dist-packages/official/core/exp_factory.py", line 31, in get_exp_config_creater
    exp_creater = registry.lookup(_REGISTERED_CONFIGS, exp_name)
  File "/usr/local/lib/python3.6/dist-packages/official/core/registry.py", line 91, in lookup
    entry_name, h_idx))
LookupError: collection path  at position 0 never registered.

To clarify the question, I would like to know what is the proper way of reproducing the training of Resnet50 to get the results showed in https://github.com/tensorflow/models/blob/master/official/vision/beta/MODEL_GARDEN.md using the configuration in
https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/image_classification/imagenet_resnet50_gpu.yaml

Also, is there an easier way of processing the Imagenet tars?

saberkun commented 3 years ago

Using official/vision/beta/train.py is right. You need to provide --experiment. https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/image_classification.py#L115 @yeqingli @jaeyounkim I feel we need to have basic documentation of the main models ASAP. The projects already have documentation.

esparig commented 3 years ago

I got it! thank you for your help @saberkun. Here I explain what I did, in case someone finds it useful:

python3 $IMAGE_CLASSIFICATION/train.py --experiment=resnet_imagenet \ 
--config_file=$EXPERIMENT/imagenet_resnet50_gpu_custom.yaml \ 
--mode=train_and_eval --model_dir=$MODEL_DIR \ 
--params_override='runtime.num_gpus=1'

$MODEL_DIR is the path where to save checkpoints.

$DATA_DIR is the path that has the TF Records that I build from the tars that I downloaded from https://www.image-net.org using this script https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py but changing the Lines 350-351 because it didn't get my files. The final content is the following:

${DATA_DIR}/train-00000-of-01024
${DATA_DIR}/train-00001-of-01024
 ...
${DATA_DIR}/train-01023-of-01024

${DATA_DIR}/validation-00000-of-00128
S{DATA_DIR}/validation-00001-of-00128
 ...
${DATA_DIR}/validation-00127-of-00128

And the config file imagenet_resnet50_gpu_custom.yaml is:

  distribution_strategy: 'mirrored'
  mixed_precision_dtype: 'float16'
  loss_scale: 'dynamic'
task:
  model:
    num_classes: 1001
    input_size: [224, 224, 3]
    backbone:
      type: 'resnet'
      resnet:
        model_id: 50
  losses:
    l2_weight_decay: 0.0001
    one_hot: true
    label_smoothing: 0.1
  train_data:
    input_path: /data/imagenet/train*
    is_training: true
    global_batch_size: 256
    dtype: 'float16'
  validation_data:
    input_path: /data/imagenet/valid*
    is_training: false
    global_batch_size: 256
    dtype: 'float16'
    drop_remainder: false
trainer:
  train_steps: 625
  validation_steps: 25
  validation_interval: 625
  steps_per_loop: 625
  summary_interval: 625
  checkpoint_interval: 625
  optimizer_config:
    optimizer:
      type: 'sgd'
      sgd:
        momentum: 0.9
    learning_rate:
      type: 'stepwise'
      stepwise:
        boundaries: [18750, 37500, 50000]
        values: [0.8, 0.08, 0.008, 0.0008]
    warmup:
      type: 'linear'
      linear:
        warmup_steps: 3125
saberkun commented 3 years ago

I am not sure if /data/imagenet/train is a gcp bucket path? Should it be gs://data/imagenet/train? @arashwan

arashwan commented 3 years ago

Imagenet needs to be downloaded manually as per its license even if tfds is used https://www.tensorflow.org/datasets/catalog/imagenet2012

esparig commented 3 years ago

Exactly, I used a local path, I downloaded manually the dataset and used the imagenet_to_gcs.py script only to get the tfrecords, with the noupload flag. Is there an easier way of using the downloaded dataset with the train.py script? Btw, now I get the training working, but the system runs oom very soon. It would be great to have the requirements for each experiment.

jackd commented 3 years ago

@esparig I found using tfds much easier. Downloading the train/validation data to ~/tensorflow_datasets/downloads/manual (or $TFDS_DATA_DIR/downloads/manual, create the tfds-override.yaml file below (global_batch_size included to demonstrate reduced memory usage) and run with

python $OFFICIAL/vision/beta/train.py \
    --experiment=resnet_imagenet \
    --config_file=$CONFIGS/experiments/image_classification/imagenet_resnet50_gpu.yaml \
    --mode=train_and_eval \
    --model_dir=/tmp/foo \
    --params_override='tfds-override.yaml'

tfds-override.yaml

task:
  train_data:
    input_path: ''
    tfds_name: 'imagenet2012'
    tfds_split: 'train'
    global_batch_size: 2
  validation_data:
    input_path: ''
    tfds_name: 'imagenet2012'
    tfds_split: 'validation'
    global_batch_size: 2

From memory I had to update tensorflow-datasets to the latest stable release.

esparig commented 2 years ago

Hello everyone, I tried using tfds as @jackd suggested, but it didn't work. I got an error that says "Not enough disk space", but I do have more than 200GB available. Any further suggestions?

ubuntu@gpu-estibaliz:/hdd500/data/imagenet_tars/imagenet$ ll

total 151020536
drwxr-xr-x 6 root root         4096 Nov 17 11:53 ./
drwxr-xr-x 3 root root         4096 Nov 17 11:53 ../
drwxr-xr-x 4 2016 2016         4096 Jun 14  2012 ILSVRC2012_devkit_t12/
-rw-r--r-- 1 root root      2568145 Jun 15  2012 ILSVRC2012_devkit_t12.tar.gz
-rw-r--r-- 1 root root 147897477120 Jun 14  2012 ILSVRC2012_img_train.tar
-rw-r--r-- 1 root root   6744924160 Jun 14  2012 ILSVRC2012_img_val.tar
drwxr-xr-x 2 root root         4096 Sep 21 16:28 __pycache__/
drwxr-xr-x 2 root root         4096 Nov 17 11:28 experiments/
-rw-r--r-- 1 root root        11629 Sep 21 16:22 imagenet.py
drwxr-xr-x 2 root root         4096 Nov 17 11:32 model_checkpoints/

ubuntu@gpu-estibaliz:/hdd500/data/imagenet_tars/imagenet$ ls experiments/

custom_tfds.yaml  gpu.yaml  imagenet_resnet50_gpu.yaml  imagenet_resnet50_gpu_custom.yaml

ubuntu@gpu-estibaliz:/hdd500/data/imagenet_tars/imagenet$ sudo docker run --net=host  -it --gpus all -v /hdd500/data/imagenet_tars/imagenet:/root/tensorflow_datasets/downloads/manual manualresnet50 /bin/bash

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/

WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

root@gpu-estibaliz:/# ls /root/tensorflow_datasets/downloads/manual/

ILSVRC2012_devkit_t12  ILSVRC2012_devkit_t12.tar.gz  ILSVRC2012_img_train.tar  ILSVRC2012_img_val.tar  __pycache__  experiments  imagenet.py  model_checkpoints

root@gpu-estibaliz:/# python3 /usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py --experiment=resnet_imagenet --config_file=/root/tensorflow_datasets/downloads/manual/experiments/custom_tfds.yaml --mode=train_and_eval --model_dir=/root/tensorflow_datasets/downloads/manual/model_checkpoints

2021-11-17 11:55:13.435231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.449271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.449582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I1117 11:55:13.472259 140716787398464 train_utils.py:292] Final experiment parameters:
{'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': 'dynamic',
             'mixed_precision_dtype': 'float16',
             'num_cores_per_replica': 1,
             'num_gpus': 1,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'task': {'evaluation': {'top_k': 5},
          'init_checkpoint': None,
          'init_checkpoint_modules': 'all',
          'losses': {'l2_weight_decay': 0.0001,
                     'label_smoothing': 0.1,
                     'one_hot': True},
          'model': {'add_head_batch_norm': False,
                    'backbone': {'resnet': {'depth_multiplier': 1.0,
                                            'model_id': 50,
                                            'replace_stem_max_pool': False,
                                            'resnetd_shortcut': False,
                                            'se_ratio': 0.0,
                                            'stem_type': 'v0',
                                            'stochastic_depth_drop_rate': 0.0},
                                 'type': 'resnet'},
                    'dropout_rate': 0.0,
                    'input_size': [224, 224, 3],
                    'norm_activation': {'activation': 'relu',
                                        'norm_epsilon': 1e-05,
                                        'norm_momentum': 0.9,
                                        'use_sync_bn': False},
                    'num_classes': 1001},
          'model_output_keys': [],
          'train_data': {'aug_policy': None,
                         'aug_rand_hflip': True,
                         'aug_type': None,
                         'block_length': 1,
                         'cache': False,
                         'cycle_length': 10,
                         'decode_jpeg_only': True,
                         'deterministic': None,
                         'drop_remainder': True,
                         'dtype': 'float16',
                         'enable_tf_data_service': False,
                         'file_type': 'tfrecord',
                         'global_batch_size': 256,
                         'image_field_key': 'image/encoded',
                         'input_path': '',
                         'is_multilabel': False,
                         'is_training': True,
                         'label_field_key': 'image/class/label',
                         'randaug_magnitude': 10,
                         'seed': None,
                         'sharding': True,
                         'shuffle_buffer_size': 10000,
                         'tf_data_service_address': None,
                         'tf_data_service_job_name': None,
                         'tfds_as_supervised': False,
                         'tfds_data_dir': '',
                         'tfds_name': 'imagenet2012',
                         'tfds_skip_decoding_feature': '',
                         'tfds_split': 'train'},
          'validation_data': {'aug_policy': None,
                              'aug_rand_hflip': True,
                              'aug_type': None,
                              'block_length': 1,
                              'cache': False,
                              'cycle_length': 10,
                              'decode_jpeg_only': True,
                              'deterministic': None,
                              'drop_remainder': False,
                              'dtype': 'float16',
                              'enable_tf_data_service': False,
                              'file_type': 'tfrecord',
                              'global_batch_size': 256,
                              'image_field_key': 'image/encoded',
                              'input_path': '',
                              'is_multilabel': False,
                              'is_training': False,
                              'label_field_key': 'image/class/label',
                              'randaug_magnitude': 10,
                              'seed': None,
                              'sharding': True,
                              'shuffle_buffer_size': 10000,
                              'tf_data_service_address': None,
                              'tf_data_service_job_name': None,
                              'tfds_as_supervised': False,
                              'tfds_data_dir': '',
                              'tfds_name': 'imagenet2012',
                              'tfds_skip_decoding_feature': '',
                              'tfds_split': 'validation'}},
 'trainer': {'allow_tpu_summary': False,
             'best_checkpoint_eval_metric': '',
             'best_checkpoint_export_subdir': '',
             'best_checkpoint_metric_comp': 'higher',
             'checkpoint_interval': 625,
             'continuous_eval_timeout': 3600,
             'eval_tf_function': True,
             'eval_tf_while_loop': False,
             'loss_upper_bound': 1000000.0,
             'max_to_keep': 5,
             'optimizer_config': {'ema': None,
                                  'learning_rate': {'stepwise': {'boundaries': [18750,
                                                                                37500,
                                                                                50000],
                                                                 'name': 'PiecewiseConstantDecay',
                                                                 'offset': 0,
                                                                 'values': [0.8,
                                                                            0.08,
                                                                            0.008,
                                                                            0.0008]},
                                                    'type': 'stepwise'},
                                  'optimizer': {'sgd': {'clipnorm': None,
                                                        'clipvalue': None,
                                                        'decay': 0.0,
                                                        'global_clipnorm': None,
                                                        'momentum': 0.9,
                                                        'name': 'SGD',
                                                        'nesterov': False},
                                                'type': 'sgd'},
                                  'warmup': {'linear': {'name': 'linear',
                                                        'warmup_learning_rate': 0,
                                                        'warmup_steps': 3125},
                                             'type': 'linear'}},
             'recovery_begin_steps': 0,
             'recovery_max_trials': 0,
             'steps_per_loop': 625,
             'summary_interval': 625,
             'train_steps': 56250,
             'train_tf_function': True,
             'train_tf_while_loop': True,
             'validation_interval': 625,
             'validation_steps': 25,
             'validation_summary_subdir': 'validation'}}
I1117 11:55:13.474529 140716787398464 train_utils.py:303] Saving experiment configuration to /root/tensorflow_datasets/downloads/manual/model_checkpoints/params.yaml
2021-11-17 11:55:13.493205: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-PCIE-32GB, compute capability 7.0
I1117 11:55:13.493584 140716787398464 device_compatibility_check.py:121] Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-PCIE-32GB, compute capability 7.0
2021-11-17 11:55:13.494813: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-17 11:55:13.495692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.496071: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.496370: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.416536: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.416872: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.417060: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.417290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30997 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1117 11:55:15.080295 140716787398464 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1117 11:55:15.082595 140716787398464 train_utils.py:214] Running default trainer.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.146004 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.149153 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.152038 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.152982 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.159561 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.162897 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.428227 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.430712 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.434357 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.435827 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2021-11-17 11:55:17.963453: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
I1117 11:55:18.894689 140716787398464 dataset_info.py:443] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.1.0
I1117 11:55:19.755020 140716787398464 dataset_info.py:358] Load dataset info from /tmp/tmpnvfqzrkttfds
I1117 11:55:19.761893 140716787398464 dataset_info.py:413] Field info.description from disk and from code do not match. Keeping the one from code.
I1117 11:55:19.762345 140716787398464 dataset_info.py:413] Field info.module_name from disk and from code do not match. Keeping the one from code.
I1117 11:55:19.762815 140716787398464 dataset_builder.py:400] Generating dataset imagenet2012 (/root/tensorflow_datasets/imagenet2012/5.1.0)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 70, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 63, in main
    model_dir=model_dir)
  File "/usr/local/lib/python3.6/dist-packages/official/core/train_lib.py", line 78, in run_experiment
    params, model_dir))
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/core/train_utils.py", line 225, in create_trainer
    checkpoint_exporter=checkpoint_exporter)
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/core/base_trainer.py", line 259, in __init__
    self.task.build_inputs, self.config.task.train_data)
  File "/usr/local/lib/python3.6/dist-packages/official/core/base_trainer.py", line 159, in distribute_dataset
    *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/orbit/utils/common.py", line 85, in make_distributed_dataset
    return strategy.distribute_datasets_from_function(dataset_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1161, in distribute_datasets_from_function
    dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 589, in _distribute_datasets_from_function
    options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 169, in get_distributed_datasets_from_function
    input_contexts, dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1579, in __init__
    input_contexts, self._input_workers, dataset_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2327, in _create_datasets_from_function_with_input_context
    dataset = dataset_fn(ctx)
  File "/usr/local/lib/python3.6/dist-packages/orbit/utils/common.py", line 83, in dataset_fn
    return dataset_or_fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/tasks/image_classification.py", line 119, in build_inputs
    dataset = reader.read(input_context=input_context)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 415, in read
    self._tfds_builder)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 335, in _read_decode_and_parse_dataset
    dataset = self._read_tfds(input_context)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 268, in _read_tfds
    self._tfds_builder.download_and_prepare()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 409, in download_and_prepare
    self.info.dataset_size,
OSError: Not enough disk space. Needed: 155.84 GiB (download: Unknown size, generated: 155.84 GiB)
  In call to configurable 'Trainer' (<class 'official.core.base_trainer.Trainer'>)
  In call to configurable 'create_trainer' (<function create_trainer at 0x7ffaa809a1e0>)

root@gpu-estibaliz:/# df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay          79G   12G   63G  16% /
tmpfs            64M     0   64M   0% /dev
tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
/dev/vdb         79G   12G   63G  16% /etc/hosts
/dev/vdd1       492G  266G  201G  57% /root/tensorflow_datasets/downloads/manual
tmpfs           7.7G   12K  7.7G   1% /proc/driver/nvidia
/dev/vda1        14G  3.6G  9.9G  27% /usr/bin/nvidia-smi
tmpfs           1.6G  996K  1.6G   1% /run/nvidia-persistenced/socket
udev            7.7G     0  7.7G   0% /dev/nvidia0
tmpfs           7.7G     0  7.7G   0% /proc/acpi
tmpfs           7.7G     0  7.7G   0% /proc/scsi
tmpfs           7.7G     0  7.7G   0% /sys/firmware
husenzhang commented 2 years ago

"ValueError: imagenet-2012-tfrecord/train* does not match any files."

Folder structure:

/home/ubuntu/tensorflow_datasets/downloads/
├── manual
│   ├── ILSVRC2012_img_train.tar
│   └── ILSVRC2012_img_val.tar
saberkun commented 2 years ago

ValueError: imagenet-2012-tfrecord/train* means you are not using tfds. We place a placeholder file path here because we cannot host imagenet datasets according to its policy. Here is asking you to preprocess imagenet as tfrecords.

To use tfds, please set the fields for tfds_??? in the config. https://github.com/tensorflow/models/blob/master/official/core/config_definitions.py#L85

husenzhang commented 2 years ago

Thanks @saberkun! With tfds_ set, now the train.py started! However I ran into " OSError: Not enough disk space. Needed: 155.84 GiB (download: Unknown size, generated: 155.84 GiB."

The tfds method seems to mis-estimate disk space by shutil. I took the longer route using JPEG to tfrecord and was able to get the resnet50 run on imagenet2012.

laxmareddyp commented 1 year ago

Hi @esparig,

Please find the latest documentation for image classification using tfds and also please check other tutorials as well related to vision. Thanks

google-ml-butler[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] commented 1 year ago

Closing as stale. Please reopen if you'd like to work on this further.