tensorflow / models

Models and examples built with TensorFlow
Other
77.05k stars 45.77k forks source link

name 'contrib_training' is not defined error when trying to train tensorflow 2 model #9379

Closed eren-erver closed 4 years ago

eren-erver commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

2. Describe the bug

When I try to train tensorflow 2 model it gives me this error:

Traceback (most recent call last): File "model_main_tf2.py", line 112, in tf.compat.v1.app.run() File "D:\Anaconda3\envs\object_detection\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "D:\Anaconda3\envs\object_detection\lib\site-packages\absl\app.py", line 300, in run _run_main(main, args) File "D:\Anaconda3\envs\object_detection\lib\site-packages\absl\app.py", line 251, in _run_main sys.exit(main(argv)) File "model_main_tf2.py", line 105, in main hparams=model_hparams.create_hparams(FLAGS.hparams_overrides), File "D:\indirilenler\models-2.3.0\models-2.3.0\research\object_detection\model_hparams.py", line 43, in create_hparams hparams = contrib_training.HParams( NameError: name 'contrib_training' is not defined

3. Steps to reproduce

You should try to train the centernet_resnet101_v1_fpn_512x512_coco17_tpu-8 model with model_main_tf2.py

4. Expected behavior

It should start to train the model

5. Additional context

Edit: model_builder_tf2_test.py completes successfully

6. System information

ravikyram commented 4 years ago

@eren-erver

Provide the exact sequence of commands / steps that you executed before running into the problem.Thanks!

eren-erver commented 4 years ago

I executed this python model_main_tf2.py --pipeline_config_path=training3/centernet_resnet101_v1_fpn_512x512_coco17_tpu-8.config --model_dir=training3/ --alsologtostderr centernet_resnet101_v1_fpn_512x512_coco17_tpu-8.config is

model {
  center_net {
    num_classes: 4
    feature_extractor {
      type: "resnet_v1_101"
    }
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 512
        max_dimension: 512
        pad_to_max_dimension: true
      }
    }
    object_detection_task {
      task_loss_weight: 1.0
      offset_loss_weight: 1.0
      scale_loss_weight: 0.1
      localization_loss {
        l1_localization_loss {
        }
      }
    }
    object_center_params {
      object_center_loss_weight: 1.0
      min_box_overlap_iou: 0.7
      max_box_predictions: 100
      classification_loss {
        penalty_reduced_logistic_focal_loss {
          alpha: 2.0
          beta: 4.0
        }
      }
    }
  }
}

train_config: {

  batch_size: 128
  num_steps: 140000

  data_augmentation_options {
    random_horizontal_flip {
    }
  }

  data_augmentation_options {
    random_crop_image {
      min_aspect_ratio: 0.5
      max_aspect_ratio: 1.7
      random_coef: 0.25
    }
  }

  data_augmentation_options {
    random_adjust_hue {
    }
  }

  data_augmentation_options {
    random_adjust_contrast {
    }
  }

  data_augmentation_options {
    random_adjust_saturation {
    }
  }

  data_augmentation_options {
    random_adjust_brightness {
    }
  }

  data_augmentation_options {
    random_absolute_pad_image {
       max_height_padding: 200
       max_width_padding: 200
       pad_color: [0, 0, 0]
    }
  }

  optimizer {
    adam_optimizer: {
      epsilon: 1e-7  # Match tf.keras.optimizers.Adam's default.
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 1e-3
          schedule {
           step: 90000
           learning_rate: 1e-4
          }
          schedule {
            step: 120000
            learning_rate: 1e-5
          }
        }
      }
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false

  fine_tune_checkpoint_version: V2
  fine_tune_checkpoint: "centernet_resnet101_v1_fpn_512x512_coco17_tpu-8/checkpoint/ckpt-0"
  fine_tune_checkpoint_type: "classification"
}

train_input_reader: {
  label_map_path: "training3/object-detection.pbtxt"
  tf_record_input_reader {
    input_path: "data/train.record"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  batch_size: 1;
}

eval_input_reader: {
  label_map_path: "training3/object-detection.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "data/test.record"
  }
}
eren-erver commented 4 years ago

I fixed the issue by uninstalling everything and installing , setting-up enviroments all over again. I wish you helped me but anyways , have a great day

google-ml-butler[bot] commented 4 years ago

Are you satisfied with the resolution of your issue? Yes No

hypadr1v3 commented 3 years ago

Is there any way to resolve this without reinstalling everything? I added: autoaugment_image { policy_name: "v2" } which resulted in the same issue as the Author. The same autoaugment_image works in the legacy training program with tensorflow 1.

advaza commented 3 years ago

My workaround in utils/autoaugment_utils.py :

from tensorflow_addons import image as contrib_image
from collections import namedtuple
# pylint: disable=g-import-not-at-top
# try:
#   from tensorflow.contrib import image as contrib_image
#   from tensorflow.contrib import training as contrib_training
# except ImportError:
#   # TF 2.0 doesn't ship with contrib.
#   pass
# pylint: enable=g-import-not-at-top

...

def hparams(**kwargs):
    return namedtuple("HParams", kwargs.keys())(*kwargs.values())

  # Line 1670
  augmentation_hparams = hparams(
      cutout_max_pad_fraction=0.75,
      cutout_bbox_replace_with_mean=False,
      cutout_const=100,
      translate_const=250,
      cutout_bbox_const=50,
      translate_bbox_const=120)
gholn commented 3 years ago

Further to the same problem, I installed tensorflow==2.4.1 and used models-r2.5.0. The installation was verified by "python object_detection/builders/model_builder_tf2_test.py". It resulted in "Ran 20 tests in 4.789s. OK (skipped=1)". Created a custom dataset as per the instructions in "https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html". Training was tried on two different models from the zoo, one after another: (a) ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8 (b) centernet_mobilenetv2fpn_512x512_coco17_od

In both the cases, the training stopped with the error as below: (tf241) rudra@rudra-System-Product-Name:~/ndg/tensorflow4/models-r2.5.0/research$ python object_detection/model_main_tf2.py --model_dir=/home/rudra/ndg/tensorflow4/trg/centre_mf512_1 --pipeline_config_path=/home/rudra/ndg/tf2_model_zoo/centernet_mobnetv2_fpn_512_od/pipeline.config --alsologtostderr 2021-04-28 08:42:07.006018: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-04-28 08:42:07.006049: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-04-28 08:42:08.434410: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-04-28 08:42:08.434592: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2021-04-28 08:42:08.434607: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303) 2021-04-28 08:42:08.434620: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (rudra-System-Product-Name): /proc/driver/nvidia/version does not exist 2021-04-28 08:42:08.434901: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-04-28 08:42:08.435061: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set WARNING:tensorflow:There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce. W0428 08:42:08.435549 139675016976192 cross_device_ops.py:1321] There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce. INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',) I0428 08:42:08.435713 139675016976192 mirrored_strategy.py:350] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',) Traceback (most recent call last): File "object_detection/model_main_tf2.py", line 118, in tf.compat.v1.app.run() File "/home/rudra/anaconda3/envs/tf241/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/rudra/anaconda3/envs/tf241/lib/python3.7/site-packages/absl/app.py", line 303, in run _run_main(main, args) File "/home/rudra/anaconda3/envs/tf241/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "object_detection/model_main_tf2.py", line 108, in main hparams=model_hparams.create_hparams(FLAGS.hparams_overrides), File "/home/rudra/ndg/tensorflow4/models-r2.5.0/research/object_detection/model_hparams.py", line 43, in create_hparams hparams = contrib_training.HParams( NameError: name 'contrib_training' is not defined

Kindly advise, how to solve this issue.

luke-iqt commented 3 years ago

This should be reopened - I had the same error and this fix made it work: https://github.com/tensorflow/models/issues/9379#issuecomment-775019955

bensenW commented 2 years ago

File "D:\indirilenler\models-2.3.0\models-2.3.0\research\object_detection\model_hparams.py", line 43, in create_hparams hparams = contrib_training.HParams( NameError: name 'contrib_training' is not defined

I have the same problem,how to handle with that...

karuna-k1 commented 2 years ago

I have the same problem, too. Could someone help to fix it?

CometManAtGitHub commented 2 years ago

Hi, i had the same problems with hparams etc.

These steps finally worked for me. Maybe TF-2.7.0 resolved the errors i had in the beginning with TF-2.5.0, or to not paste TF-models into the conda-env TF installation folder, see below for context.

Please follow exactly these instructions here: https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/install.html#

You can avoid the manual cuda-specific installations (Install CUDA Toolkit, Install cudnn) by using a conda installation in the beginning. All steps must then be done in this activated env named "tensorflow": conda create --name tensorflow tensorflow-gpu==2.5.0 conda activate tensorflow

Do not paste or git-clone Tensorflow-Models into the env Tensorflow folder. When you reach below installation step, the conda env will be updated to contain tensorflow=2.7.0, which works fine in conda: "python -m pip install --use-feature=2020-resolver ."

Then prepare tfRecords and models as documented and start training by e.g.: (tensorflow) C:\Users\cometman\Documents\Tensorflow\workspace\training_demo>python model_main_tf2.py --model_dir=models/myModelTrained --pipeline_config_path=pre-trained-models/faster_rcnn_resnet50_v1_640x640_coco17_tpu-8/pipeline.config

...

INFO:tensorflow:{'Loss/BoxClassifierLoss/classification_loss': 0.095694065, 'Loss/BoxClassifierLoss/localization_loss': 0.16352704, 'Loss/RPNLoss/localization_loss': 0.0116138775, 'Loss/RPNLoss/objectness_loss': 0.01959239, 'Loss/regularization_loss': 0.0, 'Loss/total_loss': 0.29042736, 'learning_rate': 0.014666351} I1230 23:26:42.123398 18304 model_lib_v2.py:708] {'Loss/BoxClassifierLoss/classification_loss': 0.095694065, 'Loss/BoxClassifierLoss/localization_loss': 0.16352704, 'Loss/RPNLoss/localization_loss': 0.0116138775, 'Loss/RPNLoss/objectness_loss': 0.01959239, 'Loss/regularization_loss': 0.0, 'Loss/total_loss': 0.29042736, 'learning_rate': 0.014666351} INFO:tensorflow:Step 200 per-step time 0.205s

...

Hope this helps.

ebrarsahin commented 2 years ago

for me uninstalling tensorflow and install again is worked