Closed esparig closed 1 year ago
Similarly for EfficientNet:
root@745616268d5f:/usr/local/lib/python3.6/dist-packages/official/vision/image_classification# python3 classifier_trainer.py --mode=train_and_eval --model_type=efficientnet --dataset=imagenet --model_dir=$MODEL_DIR --data_dir=$DATA_DIR --config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml --params_override='runtime.num_gpus=1'
2021-09-15 06:42:04.619103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:04.629605: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:04.631699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0915 06:42:04.638748 140551238154048 classifier_trainer.py:181] Base params: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
'export': {'checkpoint': None, 'destination': None},
'mode': None,
'model': {'learning_rate': {'boundaries': None,
'decay_epochs': 2.4,
'decay_rate': 0.97,
'examples_per_epoch': None,
'initial_lr': 0.008,
'multipliers': None,
'name': 'exponential',
'scale_by_batch_size': 0.0078125,
'staircase': True,
'warmup_epochs': 5},
'loss': {'label_smoothing': 0.1, 'name': 'categorical_crossentropy'},
'model_params': {'model_name': 'efficientnet-b0',
'model_weights_path': '',
'overrides': {'activation': 'swish',
'batch_norm': 'default',
'dtype': 'float32',
'num_classes': 1000,
'rescale_input': True},
'weights_format': 'saved_model'},
'name': 'EfficientNet',
'num_classes': 1000,
'optimizer': {'beta_1': None,
'beta_2': None,
'decay': 0.9,
'epsilon': 0.001,
'lookahead': None,
'momentum': 0.9,
'moving_average_decay': None,
'name': 'rmsprop',
'nesterov': None}},
'model_dir': None,
'model_name': None,
'runtime': {'all_reduce_alg': None,
'batchnorm_spatial_persistent': False,
'dataset_num_private_threads': None,
'default_shard_dim': -1,
'distribution_strategy': 'mirrored',
'enable_xla': False,
'gpu_thread_mode': None,
'loss_scale': None,
'mixed_precision_dtype': None,
'num_cores_per_replica': 1,
'num_gpus': 0,
'num_packs': 1,
'per_gpu_thread_count': 0,
'run_eagerly': False,
'task_index': -1,
'tpu': None,
'tpu_enable_xla_dynamic_padder': None,
'worker_hosts': None},
'train': {'callbacks': {'enable_backup_and_restore': False,
'enable_checkpoint_and_export': True,
'enable_tensorboard': True,
'enable_time_history': True},
'epochs': 500,
'metrics': ['accuracy', 'top_5'],
'resume_checkpoint': True,
'set_epoch_loop': False,
'steps': None,
'tensorboard': {'track_lr': True, 'write_model_weights': False},
'time_history': {'log_steps': 100}},
'train_dataset': {'augmenter': {'name': None, 'params': None},
'batch_size': 128,
'builder': 'records',
'cache': False,
'data_dir': None,
'download': False,
'dtype': 'float32',
'file_shuffle_buffer_size': 1024,
'filenames': None,
'image_size': 224,
'mean_subtract': False,
'name': 'imagenet2012',
'num_channels': 3,
'num_classes': 1000,
'num_devices': 1,
'num_examples': 1281167,
'one_hot': True,
'shuffle_buffer_size': 10000,
'skip_decoding': True,
'split': 'train',
'standardize': False,
'tf_data_service': None,
'use_per_replica_batch_size': True},
'validation_dataset': {'augmenter': {'name': None, 'params': None},
'batch_size': 128,
'builder': 'records',
'cache': False,
'data_dir': None,
'download': False,
'dtype': 'float32',
'file_shuffle_buffer_size': 1024,
'filenames': None,
'image_size': 224,
'mean_subtract': False,
'name': 'imagenet2012',
'num_channels': 3,
'num_classes': 1000,
'num_devices': 1,
'num_examples': 1281167,
'one_hot': True,
'shuffle_buffer_size': 10000,
'skip_decoding': True,
'split': 'validation',
'standardize': False,
'tf_data_service': None,
'use_per_replica_batch_size': True}}
I0915 06:42:04.642026 140551238154048 classifier_trainer.py:184] Overriding params: configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml
I0915 06:42:04.653718 140551238154048 classifier_trainer.py:184] Overriding params: runtime.num_gpus=1
I0915 06:42:04.654549 140551238154048 classifier_trainer.py:184] Overriding params: {'model_dir': '', 'mode': 'train_and_eval', 'model': {'name': 'efficientnet'}, 'runtime': {'run_eagerly': None, 'tpu': None}, 'train_dataset': {'data_dir': ''}, 'validation_dataset': {'data_dir': ''}, 'train': {'time_history': {'log_steps': 100}}}
I0915 06:42:04.656747 140551238154048 classifier_trainer.py:190] Final model parameters: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
'export': {'checkpoint': None, 'destination': None},
'mode': 'train_and_eval',
'model': {'learning_rate': {'boundaries': None,
'decay_epochs': 2.4,
'decay_rate': 0.97,
'examples_per_epoch': None,
'initial_lr': 0.008,
'multipliers': None,
'name': 'exponential',
'scale_by_batch_size': 0.0078125,
'staircase': True,
'warmup_epochs': 5},
'loss': {'label_smoothing': 0.1, 'name': 'categorical_crossentropy'},
'model_params': {'model_name': 'efficientnet-b0',
'model_weights_path': '',
'overrides': {'activation': 'swish',
'batch_norm': 'default',
'dtype': 'float32',
'num_classes': 1000,
'rescale_input': True},
'weights_format': 'saved_model'},
'name': 'efficientnet',
'num_classes': 1000,
'optimizer': {'beta_1': None,
'beta_2': None,
'decay': 0.9,
'epsilon': 0.001,
'lookahead': False,
'momentum': 0.9,
'moving_average_decay': 0.0,
'name': 'rmsprop',
'nesterov': None}},
'model_dir': '',
'model_name': None,
'runtime': {'all_reduce_alg': None,
'batchnorm_spatial_persistent': False,
'dataset_num_private_threads': None,
'default_shard_dim': -1,
'distribution_strategy': 'mirrored',
'enable_xla': False,
'gpu_thread_mode': None,
'loss_scale': None,
'mixed_precision_dtype': None,
'num_cores_per_replica': 1,
'num_gpus': 1,
'num_packs': 1,
'per_gpu_thread_count': 0,
'run_eagerly': None,
'task_index': -1,
'tpu': None,
'tpu_enable_xla_dynamic_padder': None,
'worker_hosts': None},
'train': {'callbacks': {'enable_backup_and_restore': False,
'enable_checkpoint_and_export': True,
'enable_tensorboard': True,
'enable_time_history': True},
'epochs': 500,
'metrics': ['accuracy', 'top_5'],
'resume_checkpoint': True,
'set_epoch_loop': False,
'steps': None,
'tensorboard': {'track_lr': True, 'write_model_weights': False},
'time_history': {'log_steps': 100}},
'train_dataset': {'augmenter': {'name': 'autoaugment', 'params': None},
'batch_size': 32,
'builder': 'records',
'cache': False,
'data_dir': '',
'download': False,
'dtype': 'float32',
'file_shuffle_buffer_size': 1024,
'filenames': None,
'image_size': 224,
'mean_subtract': False,
'name': 'imagenet2012',
'num_channels': 3,
'num_classes': 1000,
'num_devices': 1,
'num_examples': 1281167,
'one_hot': True,
'shuffle_buffer_size': 10000,
'skip_decoding': True,
'split': 'train',
'standardize': False,
'tf_data_service': None,
'use_per_replica_batch_size': True},
'validation_dataset': {'augmenter': {'name': None, 'params': None},
'batch_size': 32,
'builder': 'records',
'cache': False,
'data_dir': '',
'download': False,
'dtype': 'float32',
'file_shuffle_buffer_size': 1024,
'filenames': None,
'image_size': 224,
'mean_subtract': False,
'name': 'imagenet2012',
'num_channels': 3,
'num_classes': 1000,
'num_devices': 1,
'num_examples': 50000,
'one_hot': True,
'shuffle_buffer_size': 10000,
'skip_decoding': True,
'split': 'validation',
'standardize': False,
'tf_data_service': None,
'use_per_replica_batch_size': True}}
I0915 06:42:04.657643 140551238154048 classifier_trainer.py:290] Running train and eval.
2021-09-15 06:42:04.658841: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-15 06:42:04.659902: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:04.661132: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:04.662254: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:05.669043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:05.670517: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:05.671733: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 06:42:05.672929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30995 MB memory: -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0915 06:42:06.720428 140551238154048 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0915 06:42:06.722068 140551238154048 classifier_trainer.py:305] Detected 1 devices.
W0915 06:42:06.722257 140551238154048 classifier_trainer.py:105] label_smoothing > 0, so datasets will be one hot encoded.
I0915 06:42:06.722772 140551238154048 dataset_factory.py:176] Using augmentation: autoaugment
I0915 06:42:06.723030 140551238154048 dataset_factory.py:176] Using augmentation: None
I0915 06:42:06.723595 140551238154048 dataset_factory.py:369] Using TFRecords to load data.
Traceback (most recent call last):
File "classifier_trainer.py", line 456, in <module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "classifier_trainer.py", line 443, in main
stats = run(flags.FLAGS)
File "classifier_trainer.py", line 435, in run
return train_and_eval(params, strategy_override)
File "classifier_trainer.py", line 312, in train_and_eval
builder.build(strategy) if builder else None for builder in builders
File "classifier_trainer.py", line 312, in <listcomp>
builder.build(strategy) if builder else None for builder in builders
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 302, in build
dataset = strategy.distribute_datasets_from_function(self._build)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1161, in distribute_datasets_from_function
dataset_fn, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 589, in _distribute_datasets_from_function
options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 169, in get_distributed_datasets_from_function
input_contexts, dataset_fn, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1579, in __init__
input_contexts, self._input_workers, dataset_fn))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2327, in _create_datasets_from_function_with_input_context
dataset = dataset_fn(ctx)
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 333, in _build
dataset = builder()
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 376, in load_records
dataset = tf.data.Dataset.list_files(file_pattern, shuffle=False)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 1230, in list_files
condition, [message], summarize=1, name="assert_not_empty")
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/tf_should_use.py", line 247, in wrapped
return _add_should_use_warning(fn(*args, **kwargs),
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 164, in Assert
(condition, "\n".join(data_str)))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected 'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'No files matched pattern: train*'
Do you have imagenet dataset processed ready? The tutorials don't contain processed imagenet dataset tfrecords because of the license issue.
Yes, I figured out I needed the tfrecords, I did this:
export IMAGENET_HOME=<my_imagenet_folder>
mkdir -p $IMAGENET_HOME/validation
mkdir -p $IMAGENET_HOME/train
tar xf ILSVRC2012_img_val.tar -C $IMAGENET_HOME/validation
tar xf ILSVRC2012_img_train.tar -C $IMAGENET_HOME/train
cd $IMAGENET_HOME/train
for f in *.tar; do
d=`basename $f .tar`
mkdir $d
tar xf $f -C $d
rm $f #removes previously extracted tar
done
wget -O $IMAGENET_HOME/synset_labels.txt \ https://raw.githubusercontent.com/tensorflow/models/c7df5a3dde886509fbd1c7b317f76fb876f23506/research/inception/inception/data/imagenet_2012_validation_synset_labels.txt
wget -O $IMAGENET_HOME/imagenet_to_gcs.py \ https://github.com/tensorflow/tpu/blob/8cca0ff35e1d8c6fcd1dfac98978495ff2cadb84/tools/datasets/imagenet_to_gcs.py
sudo docker run -v $IMAGENET_HOME:/data/imagenet --gpus all -it --rm tensorflow/tensorflow:latest-gpu /bin/bash
python3 -m pip install --upgrade pip &&\ pip install tf-models-official &&\ pip install gcloud google-cloud-storage
export IMAGENET_HOME=<my_imagenet_folder>
Remember to get the script from github first. The TFRecords will end up in the --local_scratch_dir. To upload to gcs with this method leave off nogcs_upload
and provide gcs flags for project and output_path.
python3 imagenet_to_gcs.py \ --raw_data_dir=$IMAGENET_HOME \ --local_scratch_dir=$IMAGENET_HOME/tfrecord \ --nogcs_upload
So far I have train and validation folders, containing the tfrecords in /data/imagenet/imagenet2012/5.1.0/
python3 /usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py --mode=train_and_eval --model_type=resnet --dataset=imagenet --model_dir=$MODEL_DIR --data_dir=/data/imagenet --config_file=/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/configs/examples/resnet/imagenet/gpu.yaml --params_override='runtime.num_gpus=1'
2021-09-28 11:25:20.797826: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:20.810368: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:20.811266: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0928 11:25:20.815533 140110323226432 classifier_trainer.py:181] Base params: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
'export': {'checkpoint': None, 'destination': None},
'mode': None,
'model': {'learning_rate': {'boundaries': [30, 60, 80],
'decay_epochs': None,
'decay_rate': None,
'examples_per_epoch': 1281167,
'initial_lr': 0.1,
'multipliers': [0.000390625,
3.90625e-05,
3.90625e-06,
3.90625e-07],
'name': 'stepwise',
'scale_by_batch_size': 0.00390625,
'staircase': None,
'warmup_epochs': 5},
'loss': {'label_smoothing': None,
'name': 'sparse_categorical_crossentropy'},
'model_params': {'batch_size': None,
'num_classes': 1000,
'rescale_inputs': False,
'use_l2_regularizer': True},
'name': 'ResNet',
'num_classes': 1000,
'optimizer': {'beta_1': None,
'beta_2': None,
'decay': 0.9,
'epsilon': 0.001,
'lookahead': None,
'momentum': 0.9,
'moving_average_decay': None,
'name': 'momentum',
'nesterov': None}},
'model_dir': None,
'model_name': None,
'runtime': {'all_reduce_alg': None,
'batchnorm_spatial_persistent': False,
'dataset_num_private_threads': None,
'default_shard_dim': -1,
'distribution_strategy': 'mirrored',
'enable_xla': False,
'gpu_thread_mode': None,
'loss_scale': None,
'mixed_precision_dtype': None,
'num_cores_per_replica': 1,
'num_gpus': 0,
'num_packs': 1,
'per_gpu_thread_count': 0,
'run_eagerly': False,
'task_index': -1,
'tpu': None,
'tpu_enable_xla_dynamic_padder': None,
'worker_hosts': None},
'train': {'callbacks': {'enable_backup_and_restore': False,
'enable_checkpoint_and_export': True,
'enable_tensorboard': True,
'enable_time_history': True},
'epochs': 90,
'metrics': ['accuracy', 'top_5'],
'resume_checkpoint': True,
'set_epoch_loop': False,
'steps': None,
'tensorboard': {'track_lr': True, 'write_model_weights': False},
'time_history': {'log_steps': 100}},
'train_dataset': {'augmenter': {'name': None, 'params': None},
'batch_size': 128,
'builder': 'records',
'cache': False,
'data_dir': None,
'download': False,
'dtype': 'float32',
'file_shuffle_buffer_size': 1024,
'filenames': None,
'image_size': 224,
'mean_subtract': True,
'name': 'imagenet2012',
'num_channels': 3,
'num_classes': 1000,
'num_devices': 1,
'num_examples': 1281167,
'one_hot': False,
'shuffle_buffer_size': 10000,
'skip_decoding': True,
'split': 'train',
'standardize': True,
'tf_data_service': None,
'use_per_replica_batch_size': True},
'validation_dataset': {'augmenter': {'name': None, 'params': None},
'batch_size': 128,
'builder': 'records',
'cache': False,
'data_dir': None,
'download': False,
'dtype': 'float32',
'file_shuffle_buffer_size': 1024,
'filenames': None,
'image_size': 224,
'mean_subtract': True,
'name': 'imagenet2012',
'num_channels': 3,
'num_classes': 1000,
'num_devices': 1,
'num_examples': 1281167,
'one_hot': False,
'shuffle_buffer_size': 10000,
'skip_decoding': True,
'split': 'validation',
'standardize': True,
'tf_data_service': None,
'use_per_replica_batch_size': True}}
I0928 11:25:20.816565 140110323226432 classifier_trainer.py:184] Overriding params: /usr/local/lib/python3.6/dist-packages/official/vision/image_classification/configs/examples/resnet/imagenet/gpu.yaml
I0928 11:25:20.822834 140110323226432 classifier_trainer.py:184] Overriding params: runtime.num_gpus=1
I0928 11:25:20.823509 140110323226432 classifier_trainer.py:184] Overriding params: {'model_dir': '/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/resnet', 'mode': 'train_and_eval', 'model': {'name': 'resnet'}, 'runtime': {'run_eagerly': None, 'tpu': None}, 'train_dataset': {'data_dir': '/data/imagenet'}, 'validation_dataset': {'data_dir': '/data/imagenet'}, 'train': {'time_history': {'log_steps': 100}}}
I0928 11:25:20.825434 140110323226432 classifier_trainer.py:190] Final model parameters: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
'export': {'checkpoint': None, 'destination': None},
'mode': 'train_and_eval',
'model': {'learning_rate': {'boundaries': [30, 60, 80],
'decay_epochs': None,
'decay_rate': None,
'examples_per_epoch': 1281167,
'initial_lr': 0.1,
'multipliers': [0.000390625,
3.90625e-05,
3.90625e-06,
3.90625e-07],
'name': 'stepwise',
'scale_by_batch_size': 0.00390625,
'staircase': None,
'warmup_epochs': 5},
'loss': {'label_smoothing': 0.1,
'name': 'sparse_categorical_crossentropy'},
'model_params': {'batch_size': None,
'num_classes': 1000,
'rescale_inputs': False,
'use_l2_regularizer': True},
'name': 'resnet',
'num_classes': 1000,
'optimizer': {'beta_1': None,
'beta_2': None,
'decay': 0.9,
'epsilon': 0.001,
'lookahead': None,
'momentum': 0.9,
'moving_average_decay': None,
'name': 'momentum',
'nesterov': None}},
'model_dir': '/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/resnet',
'model_name': None,
'runtime': {'all_reduce_alg': None,
'batchnorm_spatial_persistent': True,
'dataset_num_private_threads': None,
'default_shard_dim': -1,
'distribution_strategy': 'mirrored',
'enable_xla': False,
'gpu_thread_mode': None,
'loss_scale': None,
'mixed_precision_dtype': None,
'num_cores_per_replica': 1,
'num_gpus': 1,
'num_packs': 1,
'per_gpu_thread_count': 0,
'run_eagerly': None,
'task_index': -1,
'tpu': None,
'tpu_enable_xla_dynamic_padder': None,
'worker_hosts': None},
'train': {'callbacks': {'enable_backup_and_restore': False,
'enable_checkpoint_and_export': True,
'enable_tensorboard': True,
'enable_time_history': True},
'epochs': 1,
'metrics': ['accuracy', 'top_5'],
'resume_checkpoint': True,
'set_epoch_loop': False,
'steps': None,
'tensorboard': {'track_lr': True, 'write_model_weights': False},
'time_history': {'log_steps': 100}},
'train_dataset': {'augmenter': {'name': None, 'params': None},
'batch_size': 256,
'builder': 'tfds',
'cache': False,
'data_dir': '/data/imagenet',
'download': False,
'dtype': 'float16',
'file_shuffle_buffer_size': 1024,
'filenames': None,
'image_size': 224,
'mean_subtract': True,
'name': 'imagenet2012',
'num_channels': 3,
'num_classes': 1000,
'num_devices': 1,
'num_examples': 1281167,
'one_hot': False,
'shuffle_buffer_size': 10000,
'skip_decoding': True,
'split': 'train',
'standardize': True,
'tf_data_service': None,
'use_per_replica_batch_size': True},
'validation_dataset': {'augmenter': {'name': None, 'params': None},
'batch_size': 256,
'builder': 'tfds',
'cache': False,
'data_dir': '/data/imagenet',
'download': False,
'dtype': 'float16',
'file_shuffle_buffer_size': 1024,
'filenames': None,
'image_size': 224,
'mean_subtract': True,
'name': 'imagenet2012',
'num_channels': 3,
'num_classes': 1000,
'num_devices': 1,
'num_examples': 50000,
'one_hot': False,
'shuffle_buffer_size': 10000,
'skip_decoding': True,
'split': 'validation',
'standardize': True,
'tf_data_service': None,
'use_per_replica_batch_size': True}}
I0928 11:25:20.825589 140110323226432 classifier_trainer.py:290] Running train and eval.
2021-09-28 11:25:20.826318: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-28 11:25:20.826680: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:20.827574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:20.828412: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:21.797719: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:21.798760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:21.799737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:25:21.800690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30995 MB memory: -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0928 11:25:22.858247 140110323226432 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0928 11:25:22.859499 140110323226432 classifier_trainer.py:305] Detected 1 devices.
W0928 11:25:22.859601 140110323226432 classifier_trainer.py:105] label_smoothing > 0, so datasets will be one hot encoded.
I0928 11:25:22.859818 140110323226432 dataset_factory.py:176] Using augmentation: None
I0928 11:25:22.859973 140110323226432 dataset_factory.py:176] Using augmentation: None
I0928 11:25:22.860244 140110323226432 dataset_factory.py:341] Using TFDS to load data.
I0928 11:25:22.863063 140110323226432 dataset_info.py:358] Load dataset info from /data/imagenet/imagenet2012/5.1.0
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/utils/py_utils.py", line 392, in try_reraise
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/load.py", line 166, in builder
return cls(**builder_kwargs) # pytype: disable=not-instantiable
File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 900, in __init__
super().__init__(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 182, in __init__
self.info.read_from_directory(self._data_dir)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_info.py", line 363, in read_from_directory
"Try to load `DatasetInfo` from a directory which does not exist or "
FileNotFoundError: Try to load `DatasetInfo` from a directory which does not exist or does not contain `dataset_info.json`. Please delete the directory `/data/imagenet/imagenet2012/5.1.0` if you are trying to re-generate the dataset.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 456, in <module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 443, in main
stats = run(flags.FLAGS)
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 435, in run
return train_and_eval(params, strategy_override)
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 312, in train_and_eval
builder.build(strategy) if builder else None for builder in builders
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 312, in <listcomp>
builder.build(strategy) if builder else None for builder in builders
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 302, in build
dataset = strategy.distribute_datasets_from_function(self._build)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1161, in distribute_datasets_from_function
dataset_fn, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 589, in _distribute_datasets_from_function
options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 169, in get_distributed_datasets_from_function
input_contexts, dataset_fn, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1579, in __init__
input_contexts, self._input_workers, dataset_fn))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2327, in _create_datasets_from_function_with_input_context
dataset = dataset_fn(ctx)
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 333, in _build
dataset = builder()
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/dataset_factory.py", line 343, in load_tfds
builder = tfds.builder(self.config.name, data_dir=self.config.data_dir)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/load.py", line 166, in builder
return cls(**builder_kwargs) # pytype: disable=not-instantiable
File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/utils/py_utils.py", line 394, in try_reraise
reraise(e, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/utils/py_utils.py", line 361, in reraise
raise exception from e
FileNotFoundError: Failed to construct dataset imagenet2012: Try to load `DatasetInfo` from a directory which does not exist or does not contain `dataset_info.json`. Please delete the directory `/data/imagenet/imagenet2012/5.1.0` if you are trying to re-generate the dataset.
# python3 /usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py --mode=train_and_eval --model_type=resnet --dataset=imagenet --model_dir=$MODEL_DIR --data_dir=/data/imagenet --config_file=/data/imagenet/experiment/imagenet_resnet50_gpu.yaml --params_override='runtime.num_gpus=1'
2021-09-28 12:04:04.909335: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 12:04:04.921396: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 12:04:04.922776: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0928 12:04:04.929126 139733132896064 classifier_trainer.py:181] Base params: {'evaluation': {'epochs_between_evals': 1, 'skip_eval': False, 'steps': None},
'export': {'checkpoint': None, 'destination': None},
'mode': None,
'model': {'learning_rate': {'boundaries': [30, 60, 80],
'decay_epochs': None,
'decay_rate': None,
'examples_per_epoch': 1281167,
'initial_lr': 0.1,
'multipliers': [0.000390625,
3.90625e-05,
3.90625e-06,
3.90625e-07],
'name': 'stepwise',
'scale_by_batch_size': 0.00390625,
'staircase': None,
'warmup_epochs': 5},
'loss': {'label_smoothing': None,
'name': 'sparse_categorical_crossentropy'},
'model_params': {'batch_size': None,
'num_classes': 1000,
'rescale_inputs': False,
'use_l2_regularizer': True},
'name': 'ResNet',
'num_classes': 1000,
'optimizer': {'beta_1': None,
'beta_2': None,
'decay': 0.9,
'epsilon': 0.001,
'lookahead': None,
'momentum': 0.9,
'moving_average_decay': None,
'name': 'momentum',
'nesterov': None}},
'model_dir': None,
'model_name': None,
'runtime': {'all_reduce_alg': None,
'batchnorm_spatial_persistent': False,
'dataset_num_private_threads': None,
'default_shard_dim': -1,
'distribution_strategy': 'mirrored',
'enable_xla': False,
'gpu_thread_mode': None,
'loss_scale': None,
'mixed_precision_dtype': None,
'num_cores_per_replica': 1,
'num_gpus': 0,
'num_packs': 1,
'per_gpu_thread_count': 0,
'run_eagerly': False,
'task_index': -1,
'tpu': None,
'tpu_enable_xla_dynamic_padder': None,
'worker_hosts': None},
'train': {'callbacks': {'enable_backup_and_restore': False,
'enable_checkpoint_and_export': True,
'enable_tensorboard': True,
'enable_time_history': True},
'epochs': 90,
'metrics': ['accuracy', 'top_5'],
'resume_checkpoint': True,
'set_epoch_loop': False,
'steps': None,
'tensorboard': {'track_lr': True, 'write_model_weights': False},
'time_history': {'log_steps': 100}},
'train_dataset': {'augmenter': {'name': None, 'params': None},
'batch_size': 128,
'builder': 'records',
'cache': False,
'data_dir': None,
'download': False,
'dtype': 'float32',
'file_shuffle_buffer_size': 1024,
'filenames': None,
'image_size': 224,
'mean_subtract': True,
'name': 'imagenet2012',
'num_channels': 3,
'num_classes': 1000,
'num_devices': 1,
'num_examples': 1281167,
'one_hot': False,
'shuffle_buffer_size': 10000,
'skip_decoding': True,
'split': 'train',
'standardize': True,
'tf_data_service': None,
'use_per_replica_batch_size': True},
'validation_dataset': {'augmenter': {'name': None, 'params': None},
'batch_size': 128,
'builder': 'records',
'cache': False,
'data_dir': None,
'download': False,
'dtype': 'float32',
'file_shuffle_buffer_size': 1024,
'filenames': None,
'image_size': 224,
'mean_subtract': True,
'name': 'imagenet2012',
'num_channels': 3,
'num_classes': 1000,
'num_devices': 1,
'num_examples': 1281167,
'one_hot': False,
'shuffle_buffer_size': 10000,
'skip_decoding': True,
'split': 'validation',
'standardize': True,
'tf_data_service': None,
'use_per_replica_batch_size': True}}
I0928 12:04:04.930369 139733132896064 classifier_trainer.py:184] Overriding params: /data/imagenet/experiment/imagenet_resnet50_gpu.yaml
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 456, in <module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 443, in main
stats = run(flags.FLAGS)
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 433, in run
params = _get_params_from_flags(flags_obj)
File "/usr/local/lib/python3.6/dist-packages/official/vision/image_classification/classifier_trainer.py", line 185, in _get_params_from_flags
params = hyperparams.override_params_dict(params, param, is_strict=True)
File "/usr/local/lib/python3.6/dist-packages/official/modeling/hyperparams/params_dict.py", line 461, in override_params_dict
params.override(yaml.load(f, Loader=yaml.FullLoader), is_strict)
File "/usr/local/lib/python3.6/dist-packages/official/modeling/hyperparams/params_dict.py", line 181, in override
self._override(override_params, is_strict) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/official/modeling/hyperparams/base_config.py", line 219, in _override
k, type(self)))
KeyError: "The key 'task' does not exist in <class 'official.vision.image_classification.configs.configs.ResNetImagenetConfig'>. To extend the existing keys, use `override` with `is_strict` = False."
# python3 /usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py --experiment=/data/imagenet/experiment/imagenet_resnet50_gpu.yaml --mode=train_and_eval --model_dir=$MODEL_DIR
2021-09-28 11:28:20.151457: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:28:20.170358: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-28 11:28:20.171795: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 70, in <module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 37, in main
params = train_utils.parse_configuration(FLAGS)
File "/usr/local/lib/python3.6/dist-packages/official/core/train_utils.py", line 248, in parse_configuration
params = exp_factory.get_exp_config(flags_obj.experiment)
File "/usr/local/lib/python3.6/dist-packages/official/core/exp_factory.py", line 36, in get_exp_config
return get_exp_config_creater(exp_name)()
File "/usr/local/lib/python3.6/dist-packages/official/core/exp_factory.py", line 31, in get_exp_config_creater
exp_creater = registry.lookup(_REGISTERED_CONFIGS, exp_name)
File "/usr/local/lib/python3.6/dist-packages/official/core/registry.py", line 91, in lookup
entry_name, h_idx))
LookupError: collection path at position 0 never registered.
To clarify the question, I would like to know what is the proper way of reproducing the training of Resnet50 to get the results showed in https://github.com/tensorflow/models/blob/master/official/vision/beta/MODEL_GARDEN.md using the configuration in
https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/image_classification/imagenet_resnet50_gpu.yaml
Also, is there an easier way of processing the Imagenet tars?
Using official/vision/beta/train.py is right. You need to provide --experiment. https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/image_classification.py#L115 @yeqingli @jaeyounkim I feel we need to have basic documentation of the main models ASAP. The projects already have documentation.
I got it! thank you for your help @saberkun. Here I explain what I did, in case someone finds it useful:
python3 $IMAGE_CLASSIFICATION/train.py --experiment=resnet_imagenet \
--config_file=$EXPERIMENT/imagenet_resnet50_gpu_custom.yaml \
--mode=train_and_eval --model_dir=$MODEL_DIR \
--params_override='runtime.num_gpus=1'
$MODEL_DIR is the path where to save checkpoints.
$DATA_DIR is the path that has the TF Records that I build from the tars that I downloaded from https://www.image-net.org using this script https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py but changing the Lines 350-351 because it didn't get my files. The final content is the following:
${DATA_DIR}/train-00000-of-01024
${DATA_DIR}/train-00001-of-01024
...
${DATA_DIR}/train-01023-of-01024
${DATA_DIR}/validation-00000-of-00128
S{DATA_DIR}/validation-00001-of-00128
...
${DATA_DIR}/validation-00127-of-00128
And the config file imagenet_resnet50_gpu_custom.yaml
is:
distribution_strategy: 'mirrored'
mixed_precision_dtype: 'float16'
loss_scale: 'dynamic'
task:
model:
num_classes: 1001
input_size: [224, 224, 3]
backbone:
type: 'resnet'
resnet:
model_id: 50
losses:
l2_weight_decay: 0.0001
one_hot: true
label_smoothing: 0.1
train_data:
input_path: /data/imagenet/train*
is_training: true
global_batch_size: 256
dtype: 'float16'
validation_data:
input_path: /data/imagenet/valid*
is_training: false
global_batch_size: 256
dtype: 'float16'
drop_remainder: false
trainer:
train_steps: 625
validation_steps: 25
validation_interval: 625
steps_per_loop: 625
summary_interval: 625
checkpoint_interval: 625
optimizer_config:
optimizer:
type: 'sgd'
sgd:
momentum: 0.9
learning_rate:
type: 'stepwise'
stepwise:
boundaries: [18750, 37500, 50000]
values: [0.8, 0.08, 0.008, 0.0008]
warmup:
type: 'linear'
linear:
warmup_steps: 3125
I am not sure if /data/imagenet/train is a gcp bucket path? Should it be gs://data/imagenet/train? @arashwan
Imagenet needs to be downloaded manually as per its license even if tfds is used https://www.tensorflow.org/datasets/catalog/imagenet2012
Exactly, I used a local path, I downloaded manually the dataset and used the imagenet_to_gcs.py script only to get the tfrecords, with the noupload flag. Is there an easier way of using the downloaded dataset with the train.py script? Btw, now I get the training working, but the system runs oom very soon. It would be great to have the requirements for each experiment.
@esparig I found using tfds
much easier. Downloading the train/validation data to ~/tensorflow_datasets/downloads/manual
(or $TFDS_DATA_DIR/downloads/manual
, create the tfds-override.yaml
file below (global_batch_size
included to demonstrate reduced memory usage) and run with
python $OFFICIAL/vision/beta/train.py \
--experiment=resnet_imagenet \
--config_file=$CONFIGS/experiments/image_classification/imagenet_resnet50_gpu.yaml \
--mode=train_and_eval \
--model_dir=/tmp/foo \
--params_override='tfds-override.yaml'
tfds-override.yaml
task:
train_data:
input_path: ''
tfds_name: 'imagenet2012'
tfds_split: 'train'
global_batch_size: 2
validation_data:
input_path: ''
tfds_name: 'imagenet2012'
tfds_split: 'validation'
global_batch_size: 2
From memory I had to update tensorflow-datasets
to the latest stable release.
Hello everyone, I tried using tfds
as @jackd suggested, but it didn't work. I got an error that says "Not enough disk space", but I do have more than 200GB available. Any further suggestions?
ubuntu@gpu-estibaliz:/hdd500/data/imagenet_tars/imagenet$ ll
total 151020536
drwxr-xr-x 6 root root 4096 Nov 17 11:53 ./
drwxr-xr-x 3 root root 4096 Nov 17 11:53 ../
drwxr-xr-x 4 2016 2016 4096 Jun 14 2012 ILSVRC2012_devkit_t12/
-rw-r--r-- 1 root root 2568145 Jun 15 2012 ILSVRC2012_devkit_t12.tar.gz
-rw-r--r-- 1 root root 147897477120 Jun 14 2012 ILSVRC2012_img_train.tar
-rw-r--r-- 1 root root 6744924160 Jun 14 2012 ILSVRC2012_img_val.tar
drwxr-xr-x 2 root root 4096 Sep 21 16:28 __pycache__/
drwxr-xr-x 2 root root 4096 Nov 17 11:28 experiments/
-rw-r--r-- 1 root root 11629 Sep 21 16:22 imagenet.py
drwxr-xr-x 2 root root 4096 Nov 17 11:32 model_checkpoints/
ubuntu@gpu-estibaliz:/hdd500/data/imagenet_tars/imagenet$ ls experiments/
custom_tfds.yaml gpu.yaml imagenet_resnet50_gpu.yaml imagenet_resnet50_gpu_custom.yaml
ubuntu@gpu-estibaliz:/hdd500/data/imagenet_tars/imagenet$ sudo docker run --net=host -it --gpus all -v /hdd500/data/imagenet_tars/imagenet:/root/tensorflow_datasets/downloads/manual manualresnet50 /bin/bash
________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.
To avoid this, run the container by specifying your user's userid:
$ docker run -u $(id -u):$(id -g) args...
root@gpu-estibaliz:/# ls /root/tensorflow_datasets/downloads/manual/
ILSVRC2012_devkit_t12 ILSVRC2012_devkit_t12.tar.gz ILSVRC2012_img_train.tar ILSVRC2012_img_val.tar __pycache__ experiments imagenet.py model_checkpoints
root@gpu-estibaliz:/# python3 /usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py --experiment=resnet_imagenet --config_file=/root/tensorflow_datasets/downloads/manual/experiments/custom_tfds.yaml --mode=train_and_eval --model_dir=/root/tensorflow_datasets/downloads/manual/model_checkpoints
2021-11-17 11:55:13.435231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.449271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.449582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I1117 11:55:13.472259 140716787398464 train_utils.py:292] Final experiment parameters:
{'runtime': {'all_reduce_alg': None,
'batchnorm_spatial_persistent': False,
'dataset_num_private_threads': None,
'default_shard_dim': -1,
'distribution_strategy': 'mirrored',
'enable_xla': False,
'gpu_thread_mode': None,
'loss_scale': 'dynamic',
'mixed_precision_dtype': 'float16',
'num_cores_per_replica': 1,
'num_gpus': 1,
'num_packs': 1,
'per_gpu_thread_count': 0,
'run_eagerly': False,
'task_index': -1,
'tpu': None,
'tpu_enable_xla_dynamic_padder': None,
'worker_hosts': None},
'task': {'evaluation': {'top_k': 5},
'init_checkpoint': None,
'init_checkpoint_modules': 'all',
'losses': {'l2_weight_decay': 0.0001,
'label_smoothing': 0.1,
'one_hot': True},
'model': {'add_head_batch_norm': False,
'backbone': {'resnet': {'depth_multiplier': 1.0,
'model_id': 50,
'replace_stem_max_pool': False,
'resnetd_shortcut': False,
'se_ratio': 0.0,
'stem_type': 'v0',
'stochastic_depth_drop_rate': 0.0},
'type': 'resnet'},
'dropout_rate': 0.0,
'input_size': [224, 224, 3],
'norm_activation': {'activation': 'relu',
'norm_epsilon': 1e-05,
'norm_momentum': 0.9,
'use_sync_bn': False},
'num_classes': 1001},
'model_output_keys': [],
'train_data': {'aug_policy': None,
'aug_rand_hflip': True,
'aug_type': None,
'block_length': 1,
'cache': False,
'cycle_length': 10,
'decode_jpeg_only': True,
'deterministic': None,
'drop_remainder': True,
'dtype': 'float16',
'enable_tf_data_service': False,
'file_type': 'tfrecord',
'global_batch_size': 256,
'image_field_key': 'image/encoded',
'input_path': '',
'is_multilabel': False,
'is_training': True,
'label_field_key': 'image/class/label',
'randaug_magnitude': 10,
'seed': None,
'sharding': True,
'shuffle_buffer_size': 10000,
'tf_data_service_address': None,
'tf_data_service_job_name': None,
'tfds_as_supervised': False,
'tfds_data_dir': '',
'tfds_name': 'imagenet2012',
'tfds_skip_decoding_feature': '',
'tfds_split': 'train'},
'validation_data': {'aug_policy': None,
'aug_rand_hflip': True,
'aug_type': None,
'block_length': 1,
'cache': False,
'cycle_length': 10,
'decode_jpeg_only': True,
'deterministic': None,
'drop_remainder': False,
'dtype': 'float16',
'enable_tf_data_service': False,
'file_type': 'tfrecord',
'global_batch_size': 256,
'image_field_key': 'image/encoded',
'input_path': '',
'is_multilabel': False,
'is_training': False,
'label_field_key': 'image/class/label',
'randaug_magnitude': 10,
'seed': None,
'sharding': True,
'shuffle_buffer_size': 10000,
'tf_data_service_address': None,
'tf_data_service_job_name': None,
'tfds_as_supervised': False,
'tfds_data_dir': '',
'tfds_name': 'imagenet2012',
'tfds_skip_decoding_feature': '',
'tfds_split': 'validation'}},
'trainer': {'allow_tpu_summary': False,
'best_checkpoint_eval_metric': '',
'best_checkpoint_export_subdir': '',
'best_checkpoint_metric_comp': 'higher',
'checkpoint_interval': 625,
'continuous_eval_timeout': 3600,
'eval_tf_function': True,
'eval_tf_while_loop': False,
'loss_upper_bound': 1000000.0,
'max_to_keep': 5,
'optimizer_config': {'ema': None,
'learning_rate': {'stepwise': {'boundaries': [18750,
37500,
50000],
'name': 'PiecewiseConstantDecay',
'offset': 0,
'values': [0.8,
0.08,
0.008,
0.0008]},
'type': 'stepwise'},
'optimizer': {'sgd': {'clipnorm': None,
'clipvalue': None,
'decay': 0.0,
'global_clipnorm': None,
'momentum': 0.9,
'name': 'SGD',
'nesterov': False},
'type': 'sgd'},
'warmup': {'linear': {'name': 'linear',
'warmup_learning_rate': 0,
'warmup_steps': 3125},
'type': 'linear'}},
'recovery_begin_steps': 0,
'recovery_max_trials': 0,
'steps_per_loop': 625,
'summary_interval': 625,
'train_steps': 56250,
'train_tf_function': True,
'train_tf_while_loop': True,
'validation_interval': 625,
'validation_steps': 25,
'validation_summary_subdir': 'validation'}}
I1117 11:55:13.474529 140716787398464 train_utils.py:303] Saving experiment configuration to /root/tensorflow_datasets/downloads/manual/model_checkpoints/params.yaml
2021-11-17 11:55:13.493205: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-PCIE-32GB, compute capability 7.0
I1117 11:55:13.493584 140716787398464 device_compatibility_check.py:121] Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-PCIE-32GB, compute capability 7.0
2021-11-17 11:55:13.494813: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-17 11:55:13.495692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.496071: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.496370: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.416536: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.416872: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.417060: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.417290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30997 MB memory: -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1117 11:55:15.080295 140716787398464 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1117 11:55:15.082595 140716787398464 train_utils.py:214] Running default trainer.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.146004 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.149153 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.152038 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.152982 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.159561 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.162897 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.428227 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.430712 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.434357 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.435827 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2021-11-17 11:55:17.963453: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
I1117 11:55:18.894689 140716787398464 dataset_info.py:443] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.1.0
I1117 11:55:19.755020 140716787398464 dataset_info.py:358] Load dataset info from /tmp/tmpnvfqzrkttfds
I1117 11:55:19.761893 140716787398464 dataset_info.py:413] Field info.description from disk and from code do not match. Keeping the one from code.
I1117 11:55:19.762345 140716787398464 dataset_info.py:413] Field info.module_name from disk and from code do not match. Keeping the one from code.
I1117 11:55:19.762815 140716787398464 dataset_builder.py:400] Generating dataset imagenet2012 (/root/tensorflow_datasets/imagenet2012/5.1.0)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 70, in <module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 63, in main
model_dir=model_dir)
File "/usr/local/lib/python3.6/dist-packages/official/core/train_lib.py", line 78, in run_experiment
params, model_dir))
File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/usr/local/lib/python3.6/dist-packages/official/core/train_utils.py", line 225, in create_trainer
checkpoint_exporter=checkpoint_exporter)
File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/usr/local/lib/python3.6/dist-packages/official/core/base_trainer.py", line 259, in __init__
self.task.build_inputs, self.config.task.train_data)
File "/usr/local/lib/python3.6/dist-packages/official/core/base_trainer.py", line 159, in distribute_dataset
*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/orbit/utils/common.py", line 85, in make_distributed_dataset
return strategy.distribute_datasets_from_function(dataset_fn)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1161, in distribute_datasets_from_function
dataset_fn, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 589, in _distribute_datasets_from_function
options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 169, in get_distributed_datasets_from_function
input_contexts, dataset_fn, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1579, in __init__
input_contexts, self._input_workers, dataset_fn))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2327, in _create_datasets_from_function_with_input_context
dataset = dataset_fn(ctx)
File "/usr/local/lib/python3.6/dist-packages/orbit/utils/common.py", line 83, in dataset_fn
return dataset_or_fn(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/tasks/image_classification.py", line 119, in build_inputs
dataset = reader.read(input_context=input_context)
File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 415, in read
self._tfds_builder)
File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 335, in _read_decode_and_parse_dataset
dataset = self._read_tfds(input_context)
File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 268, in _read_tfds
self._tfds_builder.download_and_prepare()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 409, in download_and_prepare
self.info.dataset_size,
OSError: Not enough disk space. Needed: 155.84 GiB (download: Unknown size, generated: 155.84 GiB)
In call to configurable 'Trainer' (<class 'official.core.base_trainer.Trainer'>)
In call to configurable 'create_trainer' (<function create_trainer at 0x7ffaa809a1e0>)
root@gpu-estibaliz:/# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 79G 12G 63G 16% /
tmpfs 64M 0 64M 0% /dev
tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup
shm 64M 0 64M 0% /dev/shm
/dev/vdb 79G 12G 63G 16% /etc/hosts
/dev/vdd1 492G 266G 201G 57% /root/tensorflow_datasets/downloads/manual
tmpfs 7.7G 12K 7.7G 1% /proc/driver/nvidia
/dev/vda1 14G 3.6G 9.9G 27% /usr/bin/nvidia-smi
tmpfs 1.6G 996K 1.6G 1% /run/nvidia-persistenced/socket
udev 7.7G 0 7.7G 0% /dev/nvidia0
tmpfs 7.7G 0 7.7G 0% /proc/acpi
tmpfs 7.7G 0 7.7G 0% /proc/scsi
tmpfs 7.7G 0 7.7G 0% /sys/firmware
"ValueError: imagenet-2012-tfrecord/train* does not match any files."
Folder structure:
/home/ubuntu/tensorflow_datasets/downloads/
├── manual
│ ├── ILSVRC2012_img_train.tar
│ └── ILSVRC2012_img_val.tar
ValueError: imagenet-2012-tfrecord/train* means you are not using tfds. We place a placeholder file path here because we cannot host imagenet datasets according to its policy. Here is asking you to preprocess imagenet as tfrecords.
To use tfds, please set the fields for tfds_??? in the config. https://github.com/tensorflow/models/blob/master/official/core/config_definitions.py#L85
Thanks @saberkun! With tfds_ set, now the train.py started! However I ran into " OSError: Not enough disk space. Needed: 155.84 GiB (download: Unknown size, generated: 155.84 GiB."
The tfds method seems to mis-estimate disk space by shutil. I took the longer route using JPEG to tfrecord and was able to get the resnet50 run on imagenet2012.
Hi @esparig,
Please find the latest documentation for image classification using tfds
and also please check other tutorials as well related to vision.
Thanks
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.
Closing as stale. Please reopen if you'd like to work on this further.
Prerequisites
Please answer the following question for yourself before submitting an issue.
1. The entire URL of the documentation with the issue
https://github.com/tensorflow/models/tree/master/official/vision/image_classification
2. Describe the issue
I can't reproduce the examples provided in the documentation. These are the steps I'm following:
a) sudo docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu /bin/bash b) python3 -m pip install --upgrade pip c) pip install tf-models-official d) download config files using curl (configs/examples/resnet/imagenet/gpu.yaml and configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml) e) execute the code provided:
As a result I'm getting:
Changing the command to:
I'm getting:
I don't know how to fix this, I think documentation is not very clear, please help. I wouldn't mind to use the new code base in beta, but there is even less documentation.