Closed missximon closed 2 months ago
@sgpyc can you advise?
The "OutOfRangeError" suggests that maybe something in the code is still assuming >200 classes. Is there a reason why you can't use the full dataset that we use for MLPerf runs? I'm not sure we have engineering bandwidth to support out-of-scope use cases like this.
Closing because the benchmark has retired.
Hi, I use tiny-imagenet-200 dataset yo train resnet model, but have occurred the problem:
Traceback (most recent call last): File "./resnet_ctl_imagenet_main.py", line 268, in
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "./resnet_ctl_imagenet_main.py", line 261, in main
stats = run(flags.FLAGS)
File "./resnet_ctl_imagenet_main.py", line 243, in run
resnet_controller.train(evaluate=not flags_obj.skip_eval)
File "/home/siwei.zm/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 258, in train
train_outputs = self.train_fn(steps_per_loop)
File "/home/siwei.zm/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 70, in train
self.train_loop_fn(self.train_iter, num_steps)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in call
result = self._call(*args, *kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 888, in _call
return self._stateless_fn(args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in call
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[{{node while/body/_1/while/IteratorGetNext}}]] [Op:__inference_loop_fn_24053]
Function call stack:
loop_fn
And I run the code with the script :
python3 ./resnet_ctl_imagenet_main.py \ --base_learning_rate=8.5 \' --batch_size=1 \ --clean \ --data_dir=/home/siwei.zm/mlperf/dataset/miniImageNet/tiny-imagenet-200/train \ --datasets_num_private_threads=1 \ --dtype=fp16 \ --device_warmup_steps=1 \ --noenable_device_warmup \ --enable_eager \ --noenable_xla \ --epochs_between_evals=1 \ --noeval_dataset_cache \ --eval_offset_epochs=1 \ --eval_prefetch_batchs=1 \ --label_smoothing=0.1 \ --lars_epsilon=0 \ --log_steps=1 \ --lr_schedule=polynomial \ --model_dir=/home/siwei.zm/mlperf/model/ \ --momentum=0.9 \ --num_accumulation_steps=1 \ --num_classes=200 \ --num_gpus=1 \ --optimizer=LARS \ --noreport_accuracy_metrics \ --single_l2_loss_op \ --noskip_eval \ --steps_per_loop=100000 \ --target_accuracy=0.759 \ --notf_data_experimental_slack \ --tf_gpu_thread_mode=gpu_private \ --notrace_warmup \ --train_epochs=1 \ --notraining_dataset_cache \ --training_prefetch_batchs=1 \ --nouse_synthetic_data \ --warmup_epochs=1 \ --weight_decay=0.0002
My host is 8-T4-16GB.
I am not good at it. So where is the problem?Can anyone help me? Thanks very much!