mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.6k stars 553 forks source link

OutOfRangeError: End of sequence #521

Closed missximon closed 2 months ago

missximon commented 2 years ago

Hi, I use tiny-imagenet-200 dataset yo train resnet model, but have occurred the problem:

Traceback (most recent call last): File "./resnet_ctl_imagenet_main.py", line 268, in app.run(main) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "./resnet_ctl_imagenet_main.py", line 261, in main stats = run(flags.FLAGS) File "./resnet_ctl_imagenet_main.py", line 243, in run resnet_controller.train(evaluate=not flags_obj.skip_eval) File "/home/siwei.zm/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 258, in train train_outputs = self.train_fn(steps_per_loop) File "/home/siwei.zm/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 70, in train self.train_loop_fn(self.train_iter, num_steps) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in call result = self._call(*args, *kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 888, in _call return self._stateless_fn(args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in call filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call ctx=ctx) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [[{{node while/body/_1/while/IteratorGetNext}}]] [Op:__inference_loop_fn_24053] Function call stack: loop_fn

And I run the code with the script :

python3 ./resnet_ctl_imagenet_main.py \ --base_learning_rate=8.5 \' --batch_size=1 \ --clean \ --data_dir=/home/siwei.zm/mlperf/dataset/miniImageNet/tiny-imagenet-200/train \ --datasets_num_private_threads=1 \ --dtype=fp16 \ --device_warmup_steps=1 \ --noenable_device_warmup \ --enable_eager \ --noenable_xla \ --epochs_between_evals=1 \ --noeval_dataset_cache \ --eval_offset_epochs=1 \ --eval_prefetch_batchs=1 \ --label_smoothing=0.1 \ --lars_epsilon=0 \ --log_steps=1 \ --lr_schedule=polynomial \ --model_dir=/home/siwei.zm/mlperf/model/ \ --momentum=0.9 \ --num_accumulation_steps=1 \ --num_classes=200 \ --num_gpus=1 \ --optimizer=LARS \ --noreport_accuracy_metrics \ --single_l2_loss_op \ --noskip_eval \ --steps_per_loop=100000 \ --target_accuracy=0.759 \ --notf_data_experimental_slack \ --tf_gpu_thread_mode=gpu_private \ --notrace_warmup \ --train_epochs=1 \ --notraining_dataset_cache \ --training_prefetch_batchs=1 \ --nouse_synthetic_data \ --warmup_epochs=1 \ --weight_decay=0.0002

My host is 8-T4-16GB.

I am not good at it. So where is the problem?Can anyone help me? Thanks very much!

johntran-nv commented 1 year ago

@sgpyc can you advise?

johntran-nv commented 1 year ago

The "OutOfRangeError" suggests that maybe something in the code is still assuming >200 classes. Is there a reason why you can't use the full dataset that we use for MLPerf runs? I'm not sure we have engineering bandwidth to support out-of-scope use cases like this.

hiwotadese commented 2 months ago

Closing because the benchmark has retired.