mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 549 forks source link

Image classification reference implementation is failing on Ubuntu 22.04 #643

Open arjunsuresh opened 1 year ago

arjunsuresh commented 1 year ago

I'm trying to run image classification on Ubuntu 22.04, python 3.10 and tensorflow 2.12. Currently getting the below error.

TypeError: PolynomialDecayWithWarmup.__call__() missing 1 required positional argument: 'step'
I0508 12:22:25.691070 140615434081856 coordinator.py:213] Error reported to Coordinator: PolynomialDecayWithWarmup.__call__() missing 1 required positional argument: 'step'

Detailed error

python3 ./resnet_ctl_imagenet_main.py --base_learning_rate=8.5 --batch_size=1024 --clean --data_dir=../../../imagenet/tf_records/ --datasets_num_private_threads=32 --dtype=fp32 --device_warmup_steps=1 --noenable_device_warmup --enable_eager --noenable_xla --epochs_between_evals=4 --noeval_dataset_cache --eval_offset_epochs=2 --eval_prefetch_batchs=192 --label_smoothing=0.1 --lars_epsilon=0 --log_steps=125 --lr_schedule=polynomial --model_dir=outputs --momentum=0.9 --num_accumulation_steps=2 --num_classes=1000 --num_gpus=1 --optimizer=LARS --noreport_accuracy_metrics --single_l2_loss_op --noskip_eval --steps_per_loop=1252 --target_accuracy=0.759 --notf_data_experimental_slack --tf_gpu_thread_mode=gpu_private --notrace_warmup --train_epochs=41 --notraining_dataset_cache --training_prefetch_batchs=128 --nouse_synthetic_data --warmup_epochs=5 --weight_decay=0.0002
2023-05-08 12:22:23.122980: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-05-08 12:22:23.137739: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
:::MLL 1683544944.150 cache_clear: {"value": true, "metadata": {"lineno": 114, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.149741 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 cache_clear: {"value": true, "metadata": {"lineno": 114, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 init_start: {"value": null, "metadata": {"lineno": 115, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.149899 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 init_start: {"value": null, "metadata": {"lineno": 115, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 submission_benchmark: {"value": "resnet", "metadata": {"lineno": 116, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150007 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 submission_benchmark: {"value": "resnet", "metadata": {"lineno": 116, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 submission_division: {"value": "closed", "metadata": {"lineno": 117, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150108 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 submission_division: {"value": "closed", "metadata": {"lineno": 117, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 submission_org: {"value": "google", "metadata": {"lineno": 118, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150207 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 submission_org: {"value": "google", "metadata": {"lineno": 118, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 submission_platform: {"value": "gpu-v100-1", "metadata": {"lineno": 119, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150308 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 submission_platform: {"value": "gpu-v100-1", "metadata": {"lineno": 119, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 submission_status: {"value": "cloud", "metadata": {"lineno": 122, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150408 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 submission_status: {"value": "cloud", "metadata": {"lineno": 122, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150436 140628087959552 common.py:617] Module ./resnet_ctl_imagenet_main.py:
I0508 12:22:24.150570 140628087959552 common.py:620]     flags_obj.use_tf_function = True
I0508 12:22:24.150585 140628087959552 common.py:620]     flags_obj.single_l2_loss_op = True
I0508 12:22:24.150598 140628087959552 common.py:620]     flags_obj.cache_decoded_image = False
I0508 12:22:24.150610 140628087959552 common.py:620]     flags_obj.enable_device_warmup = False
I0508 12:22:24.150622 140628087959552 common.py:620]     flags_obj.device_warmup_steps = 1
I0508 12:22:24.150634 140628087959552 common.py:620]     flags_obj.num_replicas = 32
I0508 12:22:24.150646 140628087959552 common.py:617] Module absl.app:
I0508 12:22:24.150659 140628087959552 common.py:620]     flags_obj.run_with_pdb = False
I0508 12:22:24.150671 140628087959552 common.py:620]     flags_obj.pdb_post_mortem = False
I0508 12:22:24.150684 140628087959552 common.py:620]     flags_obj.pdb = False
I0508 12:22:24.150696 140628087959552 common.py:620]     flags_obj.run_with_profiling = False
I0508 12:22:24.150707 140628087959552 common.py:620]     flags_obj.profile_file = None
I0508 12:22:24.150719 140628087959552 common.py:620]     flags_obj.use_cprofile_for_profiling = True
I0508 12:22:24.150730 140628087959552 common.py:620]     flags_obj.only_check_args = False
I0508 12:22:24.150742 140628087959552 common.py:620]     flags_obj.help = False
I0508 12:22:24.150753 140628087959552 common.py:620]     flags_obj.helpshort = False
I0508 12:22:24.150764 140628087959552 common.py:620]     flags_obj.helpfull = False
I0508 12:22:24.150775 140628087959552 common.py:620]     flags_obj.helpxml = False
I0508 12:22:24.150787 140628087959552 common.py:617] Module absl.logging:
I0508 12:22:24.150799 140628087959552 common.py:620]     flags_obj.logtostderr = False
I0508 12:22:24.150810 140628087959552 common.py:620]     flags_obj.alsologtostderr = False
I0508 12:22:24.150823 140628087959552 common.py:620]     flags_obj.log_dir = 
I0508 12:22:24.150835 140628087959552 common.py:620]     flags_obj.verbosity = 0
I0508 12:22:24.150847 140628087959552 common.py:620]     flags_obj.logger_levels = {}
I0508 12:22:24.150860 140628087959552 common.py:620]     flags_obj.stderrthreshold = fatal
I0508 12:22:24.150871 140628087959552 common.py:620]     flags_obj.showprefixforinfo = True
I0508 12:22:24.150883 140628087959552 common.py:617] Module absl.testing.absltest:
I0508 12:22:24.150895 140628087959552 common.py:620]     flags_obj.test_srcdir = 
I0508 12:22:24.150907 140628087959552 common.py:620]     flags_obj.test_tmpdir = /tmp/absl_testing
I0508 12:22:24.150918 140628087959552 common.py:620]     flags_obj.test_random_seed = 301
I0508 12:22:24.150929 140628087959552 common.py:620]     flags_obj.test_randomize_ordering_seed = 
I0508 12:22:24.150941 140628087959552 common.py:620]     flags_obj.xml_output_file = 
I0508 12:22:24.150952 140628087959552 common.py:617] Module common:
I0508 12:22:24.150964 140628087959552 common.py:620]     flags_obj.enable_eager = True
I0508 12:22:24.150975 140628087959552 common.py:620]     flags_obj.skip_eval = False
I0508 12:22:24.150986 140628087959552 common.py:620]     flags_obj.set_learning_phase_to_train = True
I0508 12:22:24.150998 140628087959552 common.py:620]     flags_obj.explicit_gpu_placement = False
I0508 12:22:24.151009 140628087959552 common.py:620]     flags_obj.use_trivial_model = False
I0508 12:22:24.151023 140628087959552 common.py:620]     flags_obj.report_accuracy_metrics = False
I0508 12:22:24.151035 140628087959552 common.py:620]     flags_obj.lr_schedule = polynomial
I0508 12:22:24.151046 140628087959552 common.py:620]     flags_obj.enable_tensorboard = False
I0508 12:22:24.151057 140628087959552 common.py:620]     flags_obj.train_steps = None
I0508 12:22:24.151069 140628087959552 common.py:620]     flags_obj.profile_steps = None
I0508 12:22:24.151080 140628087959552 common.py:620]     flags_obj.batchnorm_spatial_persistent = True
I0508 12:22:24.151092 140628087959552 common.py:620]     flags_obj.enable_get_next_as_optional = False
I0508 12:22:24.151104 140628087959552 common.py:620]     flags_obj.enable_checkpoint_and_export = False
I0508 12:22:24.151115 140628087959552 common.py:620]     flags_obj.tpu = 
I0508 12:22:24.151127 140628087959552 common.py:620]     flags_obj.tpu_zone = 
I0508 12:22:24.151138 140628087959552 common.py:620]     flags_obj.steps_per_loop = 1252
I0508 12:22:24.151149 140628087959552 common.py:620]     flags_obj.use_tf_while_loop = True
I0508 12:22:24.151160 140628087959552 common.py:620]     flags_obj.use_tf_keras_layers = False
I0508 12:22:24.151171 140628087959552 common.py:620]     flags_obj.base_learning_rate = 8.5
I0508 12:22:24.151183 140628087959552 common.py:620]     flags_obj.optimizer = LARS
I0508 12:22:24.151194 140628087959552 common.py:620]     flags_obj.drop_train_remainder = True
I0508 12:22:24.151205 140628087959552 common.py:620]     flags_obj.drop_eval_remainder = False
I0508 12:22:24.151217 140628087959552 common.py:620]     flags_obj.label_smoothing = 0.1
I0508 12:22:24.151229 140628087959552 common.py:620]     flags_obj.num_classes = 1000
I0508 12:22:24.151244 140628087959552 common.py:620]     flags_obj.eval_offset_epochs = 2
I0508 12:22:24.151255 140628087959552 common.py:620]     flags_obj.target_accuracy = 0.759
I0508 12:22:24.151266 140628087959552 common.py:617] Module lars_util:
I0508 12:22:24.151278 140628087959552 common.py:620]     flags_obj.end_learning_rate = None
I0508 12:22:24.151288 140628087959552 common.py:620]     flags_obj.lars_epsilon = 0.0
I0508 12:22:24.151300 140628087959552 common.py:620]     flags_obj.warmup_epochs = 5.0
I0508 12:22:24.151311 140628087959552 common.py:620]     flags_obj.momentum = 0.9
I0508 12:22:24.151322 140628087959552 common.py:617] Module resnet_model:
I0508 12:22:24.151334 140628087959552 common.py:620]     flags_obj.weight_decay = 0.0002
I0508 12:22:24.151345 140628087959552 common.py:620]     flags_obj.num_accumulation_steps = 2
I0508 12:22:24.151357 140628087959552 common.py:617] Module resnet_runnable:
I0508 12:22:24.151368 140628087959552 common.py:620]     flags_obj.trace_warmup = False
I0508 12:22:24.151380 140628087959552 common.py:617] Module tensorflow.python.ops.parallel_for.pfor:
I0508 12:22:24.151391 140628087959552 common.py:620]     flags_obj.op_conversion_fallback_to_while_loop = True
I0508 12:22:24.151402 140628087959552 common.py:617] Module tensorflow.python.tpu.client.client:
I0508 12:22:24.151414 140628087959552 common.py:620]     flags_obj.runtime_oom_exit = True
I0508 12:22:24.151425 140628087959552 common.py:620]     flags_obj.hbm_oom_exit = True
I0508 12:22:24.151436 140628087959552 common.py:617] Module tensorflow.python.tpu.tensor_tracer_flags:
I0508 12:22:24.151448 140628087959552 common.py:620]     flags_obj.delta_threshold = 0.5
I0508 12:22:24.151459 140628087959552 common.py:620]     flags_obj.tt_check_filter = False
I0508 12:22:24.151470 140628087959552 common.py:620]     flags_obj.tt_single_core_summaries = False
I0508 12:22:24.151482 140628087959552 common.py:617] Module tf2_common.utils.flags._base:
I0508 12:22:24.151493 140628087959552 common.py:620]     flags_obj.data_dir = ../../../imagenet/tf_records/
I0508 12:22:24.151504 140628087959552 common.py:620]     flags_obj.model_dir = outputs
I0508 12:22:24.151515 140628087959552 common.py:620]     flags_obj.clean = True
I0508 12:22:24.151526 140628087959552 common.py:620]     flags_obj.train_epochs = 41
I0508 12:22:24.151538 140628087959552 common.py:620]     flags_obj.epochs_between_evals = 4
I0508 12:22:24.151549 140628087959552 common.py:620]     flags_obj.batch_size = 1024
I0508 12:22:24.151561 140628087959552 common.py:620]     flags_obj.num_gpus = 1
I0508 12:22:24.151572 140628087959552 common.py:620]     flags_obj.run_eagerly = False
I0508 12:22:24.151583 140628087959552 common.py:620]     flags_obj.distribution_strategy = mirrored
I0508 12:22:24.151594 140628087959552 common.py:617] Module tf2_common.utils.flags._benchmark:
I0508 12:22:24.151606 140628087959552 common.py:620]     flags_obj.benchmark_logger_type = BaseBenchmarkLogger
I0508 12:22:24.151617 140628087959552 common.py:620]     flags_obj.benchmark_test_id = None
I0508 12:22:24.151628 140628087959552 common.py:620]     flags_obj.log_steps = 125
I0508 12:22:24.151639 140628087959552 common.py:620]     flags_obj.benchmark_log_dir = None
I0508 12:22:24.151650 140628087959552 common.py:620]     flags_obj.gcp_project = None
I0508 12:22:24.151661 140628087959552 common.py:620]     flags_obj.bigquery_data_set = test_benchmark
I0508 12:22:24.151672 140628087959552 common.py:620]     flags_obj.bigquery_run_table = benchmark_run
I0508 12:22:24.151683 140628087959552 common.py:620]     flags_obj.bigquery_run_status_table = benchmark_run_status
I0508 12:22:24.151695 140628087959552 common.py:620]     flags_obj.bigquery_metric_table = benchmark_metric
I0508 12:22:24.151706 140628087959552 common.py:617] Module tf2_common.utils.flags._distribution:
I0508 12:22:24.151717 140628087959552 common.py:620]     flags_obj.worker_hosts = None
I0508 12:22:24.151728 140628087959552 common.py:620]     flags_obj.task_index = -1
I0508 12:22:24.151739 140628087959552 common.py:617] Module tf2_common.utils.flags._misc:
I0508 12:22:24.151751 140628087959552 common.py:620]     flags_obj.data_format = None
I0508 12:22:24.151762 140628087959552 common.py:617] Module tf2_common.utils.flags._performance:
I0508 12:22:24.151773 140628087959552 common.py:620]     flags_obj.use_synthetic_data = False
I0508 12:22:24.151784 140628087959552 common.py:620]     flags_obj.dtype = fp32
I0508 12:22:24.151796 140628087959552 common.py:620]     flags_obj.loss_scale = None
I0508 12:22:24.151807 140628087959552 common.py:620]     flags_obj.fp16_implementation = keras
I0508 12:22:24.151818 140628087959552 common.py:620]     flags_obj.all_reduce_alg = None
I0508 12:22:24.151829 140628087959552 common.py:620]     flags_obj.num_packs = 1
I0508 12:22:24.151840 140628087959552 common.py:620]     flags_obj.tf_gpu_thread_mode = gpu_private
I0508 12:22:24.151852 140628087959552 common.py:620]     flags_obj.per_gpu_thread_count = 0
I0508 12:22:24.151863 140628087959552 common.py:620]     flags_obj.datasets_num_private_threads = 32
I0508 12:22:24.151874 140628087959552 common.py:620]     flags_obj.training_dataset_cache = False
I0508 12:22:24.151886 140628087959552 common.py:620]     flags_obj.training_prefetch_batchs = 128
I0508 12:22:24.151897 140628087959552 common.py:620]     flags_obj.eval_dataset_cache = False
I0508 12:22:24.151908 140628087959552 common.py:620]     flags_obj.eval_prefetch_batchs = 192
I0508 12:22:24.151919 140628087959552 common.py:620]     flags_obj.tf_data_experimental_slack = False
I0508 12:22:24.151931 140628087959552 common.py:620]     flags_obj.enable_xla = False
I0508 12:22:24.151942 140628087959552 common.py:620]     flags_obj.force_v2_in_keras_compile = None
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0
W0508 12:22:24.153517 140628087959552 cross_device_ops.py:1382] Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0
2023-05-08 12:22:24.159656: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1780] (One-time warning): Not using XLA:CPU for cluster.

If you want XLA:CPU, do one of the following:

 - set the TF_XLA_FLAGS to include "--tf_xla_cpu_global_jit", or
 - set cpu_global_jit to true on this session's OptimizerOptions, or
 - use experimental_jit_scope, or
 - use tf.function(jit_compile=True).

To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a
proper command-line flag, not via TF_XLA_FLAGS).
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0508 12:22:24.161878 140628087959552 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
:::MLL 1683544944.162 global_batch_size: {"value": 1024, "metadata": {"lineno": 155, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.162150 140628087959552 mlp_log.py:80] :::MLL 1683544944.162 global_batch_size: {"value": 1024, "metadata": {"lineno": 155, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.162 train_samples: {"value": 1281167, "metadata": {"lineno": 156, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.162269 140628087959552 mlp_log.py:80] :::MLL 1683544944.162 train_samples: {"value": 1281167, "metadata": {"lineno": 156, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.162 eval_samples: {"value": 50000, "metadata": {"lineno": 158, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.162377 140628087959552 mlp_log.py:80] :::MLL 1683544944.162 eval_samples: {"value": 50000, "metadata": {"lineno": 158, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.162 model_bn_span: {"value": 1024, "metadata": {"lineno": 160, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.162482 140628087959552 mlp_log.py:80] :::MLL 1683544944.162 model_bn_span: {"value": 1024, "metadata": {"lineno": 160, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.162510 140628087959552 resnet_ctl_imagenet_main.py:169] Training 42 epochs, each epoch has 1251 steps, total steps: 52542; Eval 49 steps
:::MLL 1683544944.641 opt_name: {"value": "lars", "metadata": {"lineno": 101, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.640697 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 opt_name: {"value": "lars", "metadata": {"lineno": 101, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.641 lars_epsilon: {"value": 0.0, "metadata": {"lineno": 103, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.640894 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 lars_epsilon: {"value": 0.0, "metadata": {"lineno": 103, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.641 lars_opt_weight_decay: {"value": 0.0002, "metadata": {"lineno": 104, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641026 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 lars_opt_weight_decay: {"value": 0.0002, "metadata": {"lineno": 104, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.641 lars_opt_base_learning_rate: {"value": 8.5, "metadata": {"lineno": 106, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641152 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 lars_opt_base_learning_rate: {"value": 8.5, "metadata": {"lineno": 106, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.641 lars_opt_learning_rate_warmup_epochs: {"value": 5.0, "metadata": {"lineno": 108, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641273 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 lars_opt_learning_rate_warmup_epochs: {"value": 5.0, "metadata": {"lineno": 108, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.641 lars_opt_end_learning_rate: {"value": 0.0001, "metadata": {"lineno": 110, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641392 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 lars_opt_end_learning_rate: {"value": 0.0001, "metadata": {"lineno": 110, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.642 lars_opt_learning_rate_decay_steps: {"value": 45037, "metadata": {"lineno": 115, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641583 140628087959552 mlp_log.py:80] :::MLL 1683544944.642 lars_opt_learning_rate_decay_steps: {"value": 45037, "metadata": {"lineno": 115, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.642 lars_opt_learning_rate_decay_poly_power: {"value": 2.0, "metadata": {"lineno": 117, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641716 140628087959552 mlp_log.py:80] :::MLL 1683544944.642 lars_opt_learning_rate_decay_poly_power: {"value": 2.0, "metadata": {"lineno": 117, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.642 lars_opt_momentum: {"value": 0.9, "metadata": {"lineno": 119, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641841 140628087959552 mlp_log.py:80] :::MLL 1683544944.642 lars_opt_momentum: {"value": 0.9, "metadata": {"lineno": 119, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.687 init_stop: {"value": null, "metadata": {"lineno": 223, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.686844 140628087959552 mlp_log.py:80] :::MLL 1683544944.687 init_stop: {"value": null, "metadata": {"lineno": 223, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.687 run_start: {"value": null, "metadata": {"lineno": 232, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.687031 140628087959552 mlp_log.py:80] :::MLL 1683544944.687 run_start: {"value": null, "metadata": {"lineno": 232, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.687 block_start: {"value": null, "metadata": {"first_epoch_num": 1, "epoch_count": 2, "lineno": 233, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.687142 140628087959552 mlp_log.py:80] :::MLL 1683544944.687 block_start: {"value": null, "metadata": {"first_epoch_num": 1, "epoch_count": 2, "lineno": 233, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.689408 140628087959552 controller.py:247] Train at step 0 of 52542
I0508 12:22:24.689451 140628087959552 controller.py:251] Entering training loop with 1251 steps, at step 0 of 52542
WARNING:tensorflow:From /mored/home/arjun/training/image_classification/tensorflow2/tf2_common/training/utils.py:139: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0508 12:22:24.689529 140628087959552 deprecation.py:364] From /mored/home/arjun/training/image_classification/tensorflow2/tf2_common/training/utils.py:139: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
I0508 12:22:24.691895 140628087959552 imagenet_preprocessing.py:338] Sharding the dataset: input_pipeline_id=0 num_input_pipelines=1
W0508 12:22:24.699523 140628087959552 options.py:599] options.experimental_threading is deprecated. Use options.threading instead.
I0508 12:22:24.700093 140628087959552 imagenet_preprocessing.py:104] datasets_num_private_threads: 32
I0508 12:22:24.700675 140628087959552 imagenet_preprocessing.py:118] Num classes: 1000
I0508 12:22:24.700706 140628087959552 imagenet_preprocessing.py:119] One hot: True
2023-05-08 12:22:24.965010: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [1024]
     [[{{node Placeholder/_0}}]]
2023-05-08 12:22:24.965187: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [1024]
     [[{{node Placeholder/_0}}]]
2023-05-08 12:22:25.004279: W tensorflow/core/framework/dataset.cc:807] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
2023-05-08 12:22:25.004420: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype variant
     [[{{node Placeholder/_0}}]]
2023-05-08 12:22:25.028428: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'num_steps' with dtype int32
     [[{{node num_steps}}]]
WARNING:tensorflow:From /home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py:458: calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_compile is deprecated and will be removed in a future version.
Instructions for updating:
experimental_compile is deprecated, use jit_compile instead
W0508 12:22:25.188507 140628087959552 deprecation.py:569] From /home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py:458: calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_compile is deprecated and will be removed in a future version.
Instructions for updating:
experimental_compile is deprecated, use jit_compile instead
INFO:tensorflow:Error reported to Coordinator: PolynomialDecayWithWarmup.__call__() missing 1 required positional argument: 'step'
Traceback (most recent call last):
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/training/coordinator.py", line 293, in stop_on_exception
    yield
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/distribute/mirrored_run.py", line 387, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/tmp/__autograph_generated_file0rl97hcp.py", line 144, in _apply_grads_and_clear_for_each_replica
    ag__.converted_call(ag__.ld(self).optimizer.apply_gradients, (ag__.converted_call(ag__.ld(zip), (ag__.ld(replica_accum_grads), ag__.ld(self).training_vars), None, fscope_3),), None, fscope_3)
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args)
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 665, in apply_gradients
    apply_state = self._prepare(var_list)
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 947, in _prepare
    self._prepare_local(var_device, var_dtype, apply_state)
  File "/mored/home/arjun/training/image_classification/tensorflow2/lars_optimizer.py", line 114, in _prepare_local
    lr_t = self._get_hyper("learning_rate", var_dtype)
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 804, in _get_hyper
    value = value()
TypeError: PolynomialDecayWithWarmup.__call__() missing 1 required positional argument: 'step'
I0508 12:22:25.691070 140615434081856 coordinator.py:213] Error reported to Coordinator: PolynomialDecayWithWarmup.__call__() missing 1 required positional argument: 'step'
Daming-wang commented 1 year ago

Try downgrading your TensorFlow version to 2.4.x, as well as its corresponding CUDA and cuDNN versions.

arjunsuresh commented 1 year ago

Thank you @Daming-wang for the suggestion. We'll try that but for current submission we'll be going with Nvidia code.

For the reference implementations should we document the version requirements somewhere as a lot of people will be trying that.