Closed dingfen closed 7 months ago
Hi @dingfen ,
Could you please use the latest version pip install tf-models-official
, probably the older versions not compatible with the other changes in the codebase.Please let us know after trying with latest version of Model Garden.
Thanks.
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.
I'm facing with same problem with spinnet. Did you find any solution? @dingfen
Hi, all! sorry for late. According to @laxmareddyp's response, I updated my tensorflow & tf-models-official to version 2.13. But when I reran the command, I got Errors below:
Traceback (most recent call last):
File "train.py", line 98, in <module>
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "train.py", line 92, in main
_run_experiment_with_preemption_recovery(params, model_dir)
File "train.py", line 73, in _run_experiment_with_preemption_recovery
raise e from None
File "train.py", line 51, in _run_experiment_with_preemption_recovery
train_lib.run_experiment(
File "/usr/local/lib/python3.8/dist-packages/official/core/train_lib.py", line 357, in run_experiment
return runner.run()
File "/usr/local/lib/python3.8/dist-packages/official/core/train_lib.py", line 260, in run
self.controller.train_and_evaluate(
File "/usr/local/lib/python3.8/dist-packages/orbit/controller.py", line 381, in train_and_evaluate
self.train(steps=num_steps, checkpoint_at_completion=False)
File "/usr/local/lib/python3.8/dist-packages/orbit/controller.py", line 271, in train
self._train_n_steps(num_steps)
File "/usr/local/lib/python3.8/dist-packages/orbit/controller.py", line 502, in _train_n_steps
train_output = self.trainer.train(num_steps_tensor)
File "/usr/local/lib/python3.8/dist-packages/orbit/standard_runner.py", line 146, in train
self._train_loop_fn(self._train_iter, num_steps)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
**tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:**
Detected at node 'classification_model/res_net/conv2d/Conv2D' defined at (most recent call last):
File "train.py", line 98, in <module>
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "train.py", line 92, in main
_run_experiment_with_preemption_recovery(params, model_dir)
...
File "/usr/local/lib/python3.8/dist-packages/keras/src/engine/base_layer.py", line 1150, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/src/layers/convolutional/base_conv.py", line 290, in call
outputs = self.convolution_op(inputs, self.kernel)
File "/usr/local/lib/python3.8/dist-packages/keras/src/layers/convolutional/base_conv.py", line 262, in convolution_op
return tf.nn.convolution(
Node: 'classification_model/res_net/conv2d/Conv2D'
**2 root error(s) found.
(0) UNIMPLEMENTED: DNN library is not found.
[[{{node classification_model/res_net/conv2d/Conv2D}}]]
[[while/body/_1/while/NoOp/_39]]
(1) UNIMPLEMENTED: DNN library is not found.
[[{{node classification_model/res_net/conv2d/Conv2D}}]]**
0 successful operations.
0 derived errors ignored. [Op:__inference_loop_fn_32755]
Hi @dingfen ,
It does not look like a model code error and looks like wrong CuDNN version on GPU environments.please check the following issue where the same error has been resolved.
Thanks.
Hi @laxmareddyp , According to issue,I installed libcudnn8 like that
apt-get update && apt-get install libcudnn8-dev=8.4.1.50-1+cuda11.6 libcudnn8=8.4.1.50-1+cuda11.6
But the above error still exists!
The train.py souce code:
# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""TensorFlow Model Garden Vision training driver."""
from absl import app
from absl import flags
from absl import logging
import gin
import tensorflow as tf
from official.common import distribute_utils
from official.common import flags as tfm_flags
from official.core import task_factory
from official.core import train_lib
from official.core import train_utils
from official.modeling import performance
from official.vision import registry_imports # pylint: disable=unused-import
from official.vision.utils import summary_manager
FLAGS = flags.FLAGS
def _run_experiment_with_preemption_recovery(params, model_dir):
"""Runs experiment and tries to reconnect when encounting a preemption."""
keep_training = True
while keep_training:
preemption_watcher = None
try:
distribution_strategy = distribute_utils.get_distribution_strategy(
distribution_strategy=params.runtime.distribution_strategy,
all_reduce_alg=params.runtime.all_reduce_alg,
num_gpus=params.runtime.num_gpus,
tpu_address=params.runtime.tpu)
with distribution_strategy.scope():
task = task_factory.get_task(params.task, logging_dir=model_dir)
preemption_watcher = tf.distribute.experimental.PreemptionWatcher()
train_lib.run_experiment(
distribution_strategy=distribution_strategy,
task=task,
mode=FLAGS.mode,
params=params,
model_dir=model_dir,
summary_manager=None,
eval_summary_manager=summary_manager.maybe_build_eval_summary_manager(
params=params, model_dir=model_dir
),
)
keep_training = False
except tf.errors.OpError as e:
if preemption_watcher and preemption_watcher.preemption_message:
preemption_watcher.block_until_worker_exit()
logging.info(
'Some TPU workers had been preempted (message: %s), '
'retarting training from the last checkpoint...',
preemption_watcher.preemption_message)
keep_training = True
else:
raise e from None
def main(_):
gin.parse_config_files_and_bindings(FLAGS.gin_file, FLAGS.gin_params)
params = train_utils.parse_configuration(FLAGS)
model_dir = FLAGS.model_dir
if 'train' in FLAGS.mode:
# Pure eval modes do not output yaml files. Otherwise continuous eval job
# may race against the train job for writing the same file.
train_utils.serialize_config(params, model_dir)
# Sets mixed_precision policy. Using 'mixed_float16' or 'mixed_bfloat16'
# can have significant impact on model speeds by utilizing float16 in case of
# GPUs, and bfloat16 in the case of TPUs. loss_scale takes effect only when
# dtype is float16
if params.runtime.mixed_precision_dtype:
performance.set_mixed_precision_policy(params.runtime.mixed_precision_dtype)
_run_experiment_with_preemption_recovery(params, model_dir)
train_utils.save_gin_config(FLAGS.mode, model_dir)
if __name__ == '__main__':
tfm_flags.define_flags()
flags.mark_flags_as_required(['experiment', 'mode', 'model_dir'])
app.run(main)
cuda version: V11.6.1, cudnn version: 8.4.1.
Here is my pip version list:
jupyter-tensorboard 0.2.0
tensorboard 2.13.0
tensorboard-data-server 0.7.2
tensorboard-plugin-wit 1.8.1
tensorflow 2.13.1
tensorflow-addons 0.16.1
tensorflow-datasets 3.2.1
tensorflow-estimator 2.13.0
tensorflow-hub 0.15.0
tensorflow-io-gcs-filesystem 0.34.0
tensorflow-metadata 1.7.0
tensorflow-model-optimization 0.7.5
tensorflow-text 2.13.0
tensorrt 8.2.4.2
Hi @dingfen ,
Could you please check these compatibility versions for your environment setup.
Thanks.
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.
This issue was closed due to lack of activity after being marked stale for past 7 days.
Prerequisites
Hi, because of some reasons, I'm still on TF2 r2.6.0. And I tried to train resnet_rs_imagenet on GPU. My docker is NVIDIA's nvcr.io/nvidia/tensorflow:21.12-tf2-py3, with some additional pip modules installed:
And I tried to train resnet_rs_imagenet, but a RuntimeError occured.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/blob/r2.6.0/official/core/
2. Describe the bug
Under train_and_eval mode, I run resnet_rs_imagenet model for 100 train_steps. Here is my detailed stack info:
3. Steps to reproduce
First, download the docker image: nvcr.io/nvidia/tensorflow:21.12-tf2-py3
Then, install the py modules I listed above by Pip
I'm still on TF2 r2.6.0 and I tried to train resnet_rs_imagenet with commands below:
4. Expected behavior
None Error occured.
5. Additional context
To find out why this error happened, I add some
tf.print()
inofficial/core/base_trainer.py:415
. And it print some helpful info about the functiontask_train_step
and its return loss value:It seems that Loss value became nan just after several steps.
Also, my detailed parameters are shown below in case you may need them:
6. System information