tensorflow / models

Models and examples built with TensorFlow
Other
76.94k stars 45.8k forks source link

[TF2 r2.6.0] Train and Eval Model: resnet_rs_imagenet but "RuntimeError: The loss value is NaN" occurs #11121

Closed dingfen closed 7 months ago

dingfen commented 9 months ago

Prerequisites

Hi, because of some reasons, I'm still on TF2 r2.6.0. And I tried to train resnet_rs_imagenet on GPU. My docker is NVIDIA's nvcr.io/nvidia/tensorflow:21.12-tf2-py3, with some additional pip modules installed:

gin-config                    0.5.0
sentencepiece                 0.1.97
seqeval                       1.2.2
pycocotools                   2.0.7
opencv-python                 4.6.0.66
sacrebleu                     1.2.10
jupyter-tensorboard           0.2.0
tensorboard                   2.6.0
tensorboard-data-server       0.6.1
tensorboard-plugin-wit        1.8.1
tensorflow                    2.6.3
tensorflow-addons             0.14.0
tensorflow-datasets           3.2.1
tensorflow-estimator          2.6.0
tensorflow-hub                0.12.0
tensorflow-metadata           1.5.0
tensorflow-model-optimization 0.7.3
tensorflow-text               2.6.0

And I tried to train resnet_rs_imagenet, but a RuntimeError occured.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/r2.6.0/official/core/

2. Describe the bug

Under train_and_eval mode, I run resnet_rs_imagenet model for 100 train_steps. Here is my detailed stack info:

restoring or initializing model...
initialized model.
train | step:      0 | training until step 100...
Traceback (most recent call last):
  File "train.py", line 70, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "train.py", line 58, in main
    train_lib.run_experiment(
  File "/usr/local/Tensorflow2/tensorflow_models/r2.6.0/official/core/train_lib.py", line 115, in run_experiment
    controller.train_and_evaluate(
  File "/usr/local/Tensorflow2/tensorflow_models/r2.6.0/orbit/controller.py", line 332, in train_and_evaluate
    self.train(steps=num_steps, checkpoint_at_completion=False)
  File "/usr/local/Tensorflow2/tensorflow_models/r2.6.0/orbit/controller.py", line 240, in train
    self._train_n_steps(num_steps)
  File "/usr/local/Tensorflow2/tensorflow_models/r2.6.0/orbit/controller.py", line 439, in _train_n_steps
    train_output = self.trainer.train(num_steps_tensor)
  File "/usr/local/Tensorflow2/tensorflow_models/r2.6.0/orbit/standard_runner.py", line 147, in train
    return self.train_loop_end()
  File "/usr/local/Tensorflow2/tensorflow_models/r2.6.0/official/core/base_trainer.py", line 388, in train_loop_end
    self._recovery.maybe_recover(self.train_loss.result().numpy(),
  File "/usr/local/Tensorflow2/tensorflow_models/r2.6.0/official/core/base_trainer.py", line 77, in maybe_recover
    raise RuntimeError(
RuntimeError: The loss value is NaN after training loop and it happens 1 times.

3. Steps to reproduce

First, download the docker image: nvcr.io/nvidia/tensorflow:21.12-tf2-py3

Then, install the py modules I listed above by Pip

I'm still on TF2 r2.6.0 and I tried to train resnet_rs_imagenet with commands below:

git checkout r2.6.0
cd official/vision/beta
export PYTHONPATH=$(realpath ../../../):$PYTHONPATH
export TF_FORCE_GPU_ALLOW_GROWTH=true
python train.py --experiment=resnet_rs_imagenet --mode=train_and_eval --model_dir=./results_resnet-rs50_tf2 --config_file=configs/experiments/image_classification/imagenet_resnetrs50_i160.yaml --params_override='runtime.enable_xla=False, runtime.num_gpus=1, runtime.mixed_precision_dtype=float16, runtime.distribution_strategy='one_device', task.train_data.input_path='/ppusw/datasets/vision/imagenet/tfrecords/train*', task.train_data.global_batch_size=2, task.train_data.dtype=float16, task.validation_data.input_path='/ppusw/datasets/vision/imagenet/tfrecords/valid*', task.validation_data.global_batch_size=2, task.validation_data.dtype=float16, trainer.train_steps=100, trainer.validation_steps=25000, trainer.validation_interval=640583, trainer.steps_per_loop=640583, trainer.summary_interval=640583, trainer.checkpoint_interval=640583, trainer.optimizer_config.ema='', trainer.optimizer_config.learning_rate.cosine.decay_steps=100, trainer.optimizer_config.warmup.linear.warmup_steps=0'

4. Expected behavior

None Error occured.

5. Additional context

To find out why this error happened, I add some tf.print() in official/core/base_trainer.py:415. And it print some helpful info about the function task_train_step and its return loss value:

{'loss': 7.83986902}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': 167.242157}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': 61.2824478}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': 104.935242}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': 75.9977112}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': 173.514297}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': 205.51796}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': 169.674866}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': nan}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': nan}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': nan}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': nan}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': nan}
task_train_step <bound method ImageClassificationTask.train_step of <official.vision.beta.tasks.image_classification.ImageClassificationTask object at 0x7f50d165b280>>
{'loss': nan}
......

It seems that Loss value became nan just after several steps.

Also, my detailed parameters are shown below in case you may need them:

{'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'one_device',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': 'float16',
             'num_cores_per_replica': 1,
             'num_gpus': 1,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'task': {'evaluation': {'top_k': 5},
          'init_checkpoint': None,
          'init_checkpoint_modules': 'all',
          'losses': {'l2_weight_decay': 4e-05,
                     'label_smoothing': 0.1,
                     'one_hot': True},
          'model': {'add_head_batch_norm': False,
                    'backbone': {'resnet': {'depth_multiplier': 1.0,
                                            'model_id': 50,
                                            'replace_stem_max_pool': True,
                                            'resnetd_shortcut': True,
                                            'se_ratio': 0.25,
                                            'stem_type': 'v1',
                                            'stochastic_depth_drop_rate': 0.0},
                                 'type': 'resnet'},
                    'dropout_rate': 0.25,
                    'input_size': [160, 160, 3],
                    'norm_activation': {'activation': 'swish',
                                        'norm_epsilon': 1e-05,
                                        'norm_momentum': 0.0,
                                        'use_sync_bn': False},
                    'num_classes': 1001},
          'model_output_keys': [],
          'train_data': {'aug_policy': None,
                         'aug_rand_hflip': True,
                         'aug_type': {'randaug': {'cutout_const': 40,
                                                  'magnitude': 10,
                                                  'num_layers': 2,
                                                  'prob_to_apply': None,
                                                  'translate_const': 10},
                                      'type': 'randaug'},
                         'block_length': 1,
                         'cache': False,
                         'cycle_length': 10,
                         'decode_jpeg_only': True,
                         'deterministic': None,
                         'drop_remainder': True,
                         'dtype': 'float16',
                         'enable_tf_data_service': False,
                         'file_type': 'tfrecord',
                         'global_batch_size': 2,
                         'image_field_key': 'image/encoded',
                         'input_path': '/ppusw/datasets/vision/imagenet/tfrecords/train*',
                         'is_multilabel': False,
                         'is_training': True,
                         'label_field_key': 'image/class/label',
                         'randaug_magnitude': 10,
                         'seed': None,
                         'sharding': True,
                         'shuffle_buffer_size': 10000,
                         'tf_data_service_address': None,
                         'tf_data_service_job_name': None,
                         'tfds_as_supervised': False,
                         'tfds_data_dir': '',
                         'tfds_name': '',
                         'tfds_skip_decoding_feature': '',
                         'tfds_split': ''},
          'validation_data': {'aug_policy': None,
                              'aug_rand_hflip': True,
                              'aug_type': None,
                              'block_length': 1,
                              'cache': False,
                              'cycle_length': 10,
                              'decode_jpeg_only': True,
                              'deterministic': None,
                              'drop_remainder': False,
                              'dtype': 'float16',
                              'enable_tf_data_service': False,
                              'file_type': 'tfrecord',
                              'global_batch_size': 2,
                              'image_field_key': 'image/encoded',
                              'input_path': '/ppusw/datasets/vision/imagenet/tfrecords/valid*',
                              'is_multilabel': False,
                              'is_training': False,
                              'label_field_key': 'image/class/label',
                              'randaug_magnitude': 10,
                              'seed': None,
                              'sharding': True,
                              'shuffle_buffer_size': 10000,
                              'tf_data_service_address': None,
                              'tf_data_service_job_name': None,
                              'tfds_as_supervised': False,
                              'tfds_data_dir': '',
                              'tfds_name': '',
                              'tfds_skip_decoding_feature': '',
                              'tfds_split': ''}},
 'trainer': {'allow_tpu_summary': False,
             'best_checkpoint_eval_metric': '',
             'best_checkpoint_export_subdir': '',
             'best_checkpoint_metric_comp': 'higher',
             'checkpoint_interval': 640583,
             'continuous_eval_timeout': 3600,
             'eval_tf_function': True,
             'eval_tf_while_loop': False,
             'loss_upper_bound': 1000000.0,
             'max_to_keep': 5,
             'optimizer_config': {'ema': None,
                                  'learning_rate': {'cosine': {'alpha': 0.0,
                                                               'decay_steps': 100,
                                                               'initial_learning_rate': 1.6,
                                                               'name': 'CosineDecay',
                                                               'offset': 0},
                                                    'type': 'cosine'},
                                  'optimizer': {'sgd': {'clipnorm': None,
                                                        'clipvalue': None,
                                                        'decay': 0.0,
                                                        'global_clipnorm': None,
                                                        'momentum': 0.9,
                                                        'name': 'SGD',
                                                        'nesterov': False},
                                                'type': 'sgd'},
                                  'warmup': {'linear': {'name': 'linear',
                                                        'warmup_learning_rate': 0,
                                                        'warmup_steps': 0},
                                             'type': 'linear'}},
             'recovery_begin_steps': 0,
             'recovery_max_trials': 0,
             'steps_per_loop': 640583,
             'summary_interval': 640583,
             'train_steps': 100,
             'train_tf_function': True,
             'train_tf_while_loop': True,
             'validation_interval': 640583,
             'validation_steps': 25000,
             'validation_summary_subdir': 'validation'}}

6. System information

laxmareddyp commented 9 months ago

Hi @dingfen ,

Could you please use the latest version pip install tf-models-official , probably the older versions not compatible with the other changes in the codebase.Please let us know after trying with latest version of Model Garden.

Thanks.

github-actions[bot] commented 8 months ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

TahaErr commented 8 months ago

I'm facing with same problem with spinnet. Did you find any solution? @dingfen

dingfen commented 8 months ago

Hi, all! sorry for late. According to @laxmareddyp's response, I updated my tensorflow & tf-models-official to version 2.13. But when I reran the command, I got Errors below:

Traceback (most recent call last):                                                                                                                   
  File "train.py", line 98, in <module>                                                                                                              
    app.run(main)                                                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run                                                                        
    _run_main(main, args)                                                                                                                            
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main                                                                  
    sys.exit(main(argv))                                                                                                                             
  File "train.py", line 92, in main                                                                                                                  
    _run_experiment_with_preemption_recovery(params, model_dir)                                                                                      
  File "train.py", line 73, in _run_experiment_with_preemption_recovery                                                                              
    raise e from None                                                                                                                                
  File "train.py", line 51, in _run_experiment_with_preemption_recovery                                                                              
    train_lib.run_experiment(                                                                                                                        
  File "/usr/local/lib/python3.8/dist-packages/official/core/train_lib.py", line 357, in run_experiment                                              
    return runner.run()                                                                                                                              
  File "/usr/local/lib/python3.8/dist-packages/official/core/train_lib.py", line 260, in run                                                         
    self.controller.train_and_evaluate(                                                                                                              
  File "/usr/local/lib/python3.8/dist-packages/orbit/controller.py", line 381, in train_and_evaluate                                                 
    self.train(steps=num_steps, checkpoint_at_completion=False)                                                                                      
  File "/usr/local/lib/python3.8/dist-packages/orbit/controller.py", line 271, in train                                                              
    self._train_n_steps(num_steps)                                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/orbit/controller.py", line 502, in _train_n_steps                                                     
    train_output = self.trainer.train(num_steps_tensor)                                                                                              
  File "/usr/local/lib/python3.8/dist-packages/orbit/standard_runner.py", line 146, in train                                                         
    self._train_loop_fn(self._train_iter, num_steps)                                                                                                 
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler                                
    raise e.with_traceback(filtered_tb) from None                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute                                        
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,                                                                           
**tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:**                                                                   

Detected at node 'classification_model/res_net/conv2d/Conv2D' defined at (most recent call last):                                                    
    File "train.py", line 98, in <module>                                                                                                            
      app.run(main)                                                                                                                                  
    File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run                                                                      
      _run_main(main, args)                                                                                                                          
    File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main                                                                
      sys.exit(main(argv))                                                                                                                           
    File "train.py", line 92, in main                                                                                                                
      _run_experiment_with_preemption_recovery(params, model_dir)
...
    File "/usr/local/lib/python3.8/dist-packages/keras/src/engine/base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/src/layers/convolutional/base_conv.py", line 290, in call
      outputs = self.convolution_op(inputs, self.kernel)
    File "/usr/local/lib/python3.8/dist-packages/keras/src/layers/convolutional/base_conv.py", line 262, in convolution_op
      return tf.nn.convolution(
Node: 'classification_model/res_net/conv2d/Conv2D'
**2 root error(s) found.
  (0) UNIMPLEMENTED:  DNN library is not found.
         [[{{node classification_model/res_net/conv2d/Conv2D}}]]
         [[while/body/_1/while/NoOp/_39]]
  (1) UNIMPLEMENTED:  DNN library is not found.
         [[{{node classification_model/res_net/conv2d/Conv2D}}]]**
0 successful operations.
0 derived errors ignored. [Op:__inference_loop_fn_32755]
laxmareddyp commented 8 months ago

Hi @dingfen ,

It does not look like a model code error and looks like wrong CuDNN version on GPU environments.please check the following issue where the same error has been resolved.

Thanks.

dingfen commented 8 months ago

Hi @laxmareddyp , According to issue,I installed libcudnn8 like that

apt-get update && apt-get install libcudnn8-dev=8.4.1.50-1+cuda11.6 libcudnn8=8.4.1.50-1+cuda11.6

But the above error still exists!

The train.py souce code:

# Copyright 2023 The TensorFlow Authors. All Rights Reserved.                                                                                        
#                                                                                                                                                    
# Licensed under the Apache License, Version 2.0 (the "License");                                                                                    
# you may not use this file except in compliance with the License.                                                                                   
# You may obtain a copy of the License at                                                                                                            
#                                                                                                                                                    
#     http://www.apache.org/licenses/LICENSE-2.0                                                                                                     
#                                                                                                                                                    
# Unless required by applicable law or agreed to in writing, software                                                                                
# distributed under the License is distributed on an "AS IS" BASIS,                                                                                  
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.                                                                           
# See the License for the specific language governing permissions and                                                                                
# limitations under the License.                                                                                                                     

"""TensorFlow Model Garden Vision training driver."""                                                                                                

from absl import app                                                                                                                                 
from absl import flags                                                                                                                               
from absl import logging                                                                                                                             
import gin                                                                                                                                           
import tensorflow as tf                                                                                                                              

from official.common import distribute_utils                                                                                                         
from official.common import flags as tfm_flags                                                                                                       
from official.core import task_factory                                                                                                               
from official.core import train_lib                                                                                                                  
from official.core import train_utils                                                                                                                
from official.modeling import performance                                                                                                            
from official.vision import registry_imports  # pylint: disable=unused-import                                                                        
from official.vision.utils import summary_manager                                                                                                    

FLAGS = flags.FLAGS                                                                                                                                  

def _run_experiment_with_preemption_recovery(params, model_dir): 
  """Runs experiment and tries to reconnect when encounting a preemption."""
  keep_training = True
  while keep_training:
    preemption_watcher = None
    try:
      distribution_strategy = distribute_utils.get_distribution_strategy(
          distribution_strategy=params.runtime.distribution_strategy,
          all_reduce_alg=params.runtime.all_reduce_alg,
          num_gpus=params.runtime.num_gpus,
          tpu_address=params.runtime.tpu)
      with distribution_strategy.scope():
        task = task_factory.get_task(params.task, logging_dir=model_dir)
      preemption_watcher = tf.distribute.experimental.PreemptionWatcher()

      train_lib.run_experiment(
          distribution_strategy=distribution_strategy,
          task=task,
          mode=FLAGS.mode,
          params=params,
          model_dir=model_dir,
          summary_manager=None,
          eval_summary_manager=summary_manager.maybe_build_eval_summary_manager(
              params=params, model_dir=model_dir
          ),
      )

      keep_training = False
    except tf.errors.OpError as e:
      if preemption_watcher and preemption_watcher.preemption_message:
        preemption_watcher.block_until_worker_exit()
        logging.info(
            'Some TPU workers had been preempted (message: %s), '
            'retarting training from the last checkpoint...',
            preemption_watcher.preemption_message)
        keep_training = True
      else:
        raise e from None

def main(_):
  gin.parse_config_files_and_bindings(FLAGS.gin_file, FLAGS.gin_params)
  params = train_utils.parse_configuration(FLAGS)
  model_dir = FLAGS.model_dir
  if 'train' in FLAGS.mode:
    # Pure eval modes do not output yaml files. Otherwise continuous eval job
    # may race against the train job for writing the same file.
    train_utils.serialize_config(params, model_dir)

  # Sets mixed_precision policy. Using 'mixed_float16' or 'mixed_bfloat16'
  # can have significant impact on model speeds by utilizing float16 in case of
  # GPUs, and bfloat16 in the case of TPUs. loss_scale takes effect only when
  # dtype is float16
  if params.runtime.mixed_precision_dtype:
    performance.set_mixed_precision_policy(params.runtime.mixed_precision_dtype)

  _run_experiment_with_preemption_recovery(params, model_dir)
  train_utils.save_gin_config(FLAGS.mode, model_dir)

if __name__ == '__main__':
  tfm_flags.define_flags()
  flags.mark_flags_as_required(['experiment', 'mode', 'model_dir'])
  app.run(main)

cuda version: V11.6.1, cudnn version: 8.4.1.

Here is my pip version list:

jupyter-tensorboard           0.2.0
tensorboard                   2.13.0
tensorboard-data-server       0.7.2
tensorboard-plugin-wit        1.8.1
tensorflow                    2.13.1
tensorflow-addons             0.16.1
tensorflow-datasets           3.2.1
tensorflow-estimator          2.13.0
tensorflow-hub                0.15.0
tensorflow-io-gcs-filesystem  0.34.0
tensorflow-metadata           1.7.0
tensorflow-model-optimization 0.7.5
tensorflow-text               2.13.0
tensorrt                      8.2.4.2
laxmareddyp commented 8 months ago

Hi @dingfen ,

Could you please check these compatibility versions for your environment setup.

Thanks.

github-actions[bot] commented 7 months ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 7 months ago

This issue was closed due to lack of activity after being marked stale for past 7 days.

google-ml-butler[bot] commented 7 months ago

Are you satisfied with the resolution of your issue? Yes No