Closed eren-erver closed 4 years ago
@eren-erver
Provide the exact sequence of commands / steps that you executed before running into the problem.Thanks!
I executed this python model_main_tf2.py --pipeline_config_path=training3/centernet_resnet101_v1_fpn_512x512_coco17_tpu-8.config --model_dir=training3/ --alsologtostderr centernet_resnet101_v1_fpn_512x512_coco17_tpu-8.config is
model {
center_net {
num_classes: 4
feature_extractor {
type: "resnet_v1_101"
}
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 512
max_dimension: 512
pad_to_max_dimension: true
}
}
object_detection_task {
task_loss_weight: 1.0
offset_loss_weight: 1.0
scale_loss_weight: 0.1
localization_loss {
l1_localization_loss {
}
}
}
object_center_params {
object_center_loss_weight: 1.0
min_box_overlap_iou: 0.7
max_box_predictions: 100
classification_loss {
penalty_reduced_logistic_focal_loss {
alpha: 2.0
beta: 4.0
}
}
}
}
}
train_config: {
batch_size: 128
num_steps: 140000
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_crop_image {
min_aspect_ratio: 0.5
max_aspect_ratio: 1.7
random_coef: 0.25
}
}
data_augmentation_options {
random_adjust_hue {
}
}
data_augmentation_options {
random_adjust_contrast {
}
}
data_augmentation_options {
random_adjust_saturation {
}
}
data_augmentation_options {
random_adjust_brightness {
}
}
data_augmentation_options {
random_absolute_pad_image {
max_height_padding: 200
max_width_padding: 200
pad_color: [0, 0, 0]
}
}
optimizer {
adam_optimizer: {
epsilon: 1e-7 # Match tf.keras.optimizers.Adam's default.
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 1e-3
schedule {
step: 90000
learning_rate: 1e-4
}
schedule {
step: 120000
learning_rate: 1e-5
}
}
}
}
use_moving_average: false
}
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
fine_tune_checkpoint_version: V2
fine_tune_checkpoint: "centernet_resnet101_v1_fpn_512x512_coco17_tpu-8/checkpoint/ckpt-0"
fine_tune_checkpoint_type: "classification"
}
train_input_reader: {
label_map_path: "training3/object-detection.pbtxt"
tf_record_input_reader {
input_path: "data/train.record"
}
}
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
batch_size: 1;
}
eval_input_reader: {
label_map_path: "training3/object-detection.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "data/test.record"
}
}
I fixed the issue by uninstalling everything and installing , setting-up enviroments all over again. I wish you helped me but anyways , have a great day
Is there any way to resolve this without reinstalling everything? I added:
autoaugment_image { policy_name: "v2" }
which resulted in the same issue as the Author. The same autoaugment_image works in the legacy training program with tensorflow 1.
My workaround in utils/autoaugment_utils.py
:
from tensorflow_addons import image as contrib_image
from collections import namedtuple
# pylint: disable=g-import-not-at-top
# try:
# from tensorflow.contrib import image as contrib_image
# from tensorflow.contrib import training as contrib_training
# except ImportError:
# # TF 2.0 doesn't ship with contrib.
# pass
# pylint: enable=g-import-not-at-top
...
def hparams(**kwargs):
return namedtuple("HParams", kwargs.keys())(*kwargs.values())
# Line 1670
augmentation_hparams = hparams(
cutout_max_pad_fraction=0.75,
cutout_bbox_replace_with_mean=False,
cutout_const=100,
translate_const=250,
cutout_bbox_const=50,
translate_bbox_const=120)
Further to the same problem, I installed tensorflow==2.4.1 and used models-r2.5.0. The installation was verified by "python object_detection/builders/model_builder_tf2_test.py". It resulted in "Ran 20 tests in 4.789s. OK (skipped=1)". Created a custom dataset as per the instructions in "https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html". Training was tried on two different models from the zoo, one after another: (a) ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8 (b) centernet_mobilenetv2fpn_512x512_coco17_od
In both the cases, the training stopped with the error as below:
(tf241) rudra@rudra-System-Product-Name:~/ndg/tensorflow4/models-r2.5.0/research$ python object_detection/model_main_tf2.py --model_dir=/home/rudra/ndg/tensorflow4/trg/centre_mf512_1 --pipeline_config_path=/home/rudra/ndg/tf2_model_zoo/centernet_mobnetv2_fpn_512_od/pipeline.config --alsologtostderr
2021-04-28 08:42:07.006018: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-04-28 08:42:07.006049: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-04-28 08:42:08.434410: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-28 08:42:08.434592: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-04-28 08:42:08.434607: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-04-28 08:42:08.434620: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (rudra-System-Product-Name): /proc/driver/nvidia/version does not exist
2021-04-28 08:42:08.434901: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-28 08:42:08.435061: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
WARNING:tensorflow:There are non-GPU devices in tf.distribute.Strategy
, not using nccl allreduce.
W0428 08:42:08.435549 139675016976192 cross_device_ops.py:1321] There are non-GPU devices in tf.distribute.Strategy
, not using nccl allreduce.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
I0428 08:42:08.435713 139675016976192 mirrored_strategy.py:350] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Traceback (most recent call last):
File "object_detection/model_main_tf2.py", line 118, in
Kindly advise, how to solve this issue.
This should be reopened - I had the same error and this fix made it work: https://github.com/tensorflow/models/issues/9379#issuecomment-775019955
File "D:\indirilenler\models-2.3.0\models-2.3.0\research\object_detection\model_hparams.py", line 43, in create_hparams hparams = contrib_training.HParams( NameError: name 'contrib_training' is not defined
I have the same problem,how to handle with that...
I have the same problem, too. Could someone help to fix it?
Hi, i had the same problems with hparams etc.
These steps finally worked for me. Maybe TF-2.7.0 resolved the errors i had in the beginning with TF-2.5.0, or to not paste TF-models into the conda-env TF installation folder, see below for context.
Please follow exactly these instructions here: https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/install.html#
You can avoid the manual cuda-specific installations (Install CUDA Toolkit, Install cudnn) by using a conda installation in the beginning. All steps must then be done in this activated env named "tensorflow": conda create --name tensorflow tensorflow-gpu==2.5.0 conda activate tensorflow
Do not paste or git-clone Tensorflow-Models into the env Tensorflow folder. When you reach below installation step, the conda env will be updated to contain tensorflow=2.7.0, which works fine in conda: "python -m pip install --use-feature=2020-resolver ."
Then prepare tfRecords and models as documented and start training by e.g.: (tensorflow) C:\Users\cometman\Documents\Tensorflow\workspace\training_demo>python model_main_tf2.py --model_dir=models/myModelTrained --pipeline_config_path=pre-trained-models/faster_rcnn_resnet50_v1_640x640_coco17_tpu-8/pipeline.config
...
INFO:tensorflow:{'Loss/BoxClassifierLoss/classification_loss': 0.095694065, 'Loss/BoxClassifierLoss/localization_loss': 0.16352704, 'Loss/RPNLoss/localization_loss': 0.0116138775, 'Loss/RPNLoss/objectness_loss': 0.01959239, 'Loss/regularization_loss': 0.0, 'Loss/total_loss': 0.29042736, 'learning_rate': 0.014666351} I1230 23:26:42.123398 18304 model_lib_v2.py:708] {'Loss/BoxClassifierLoss/classification_loss': 0.095694065, 'Loss/BoxClassifierLoss/localization_loss': 0.16352704, 'Loss/RPNLoss/localization_loss': 0.0116138775, 'Loss/RPNLoss/objectness_loss': 0.01959239, 'Loss/regularization_loss': 0.0, 'Loss/total_loss': 0.29042736, 'learning_rate': 0.014666351} INFO:tensorflow:Step 200 per-step time 0.205s
...
Hope this helps.
for me uninstalling tensorflow and install again is worked
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py
2. Describe the bug
When I try to train tensorflow 2 model it gives me this error:
Traceback (most recent call last): File "model_main_tf2.py", line 112, in
tf.compat.v1.app.run()
File "D:\Anaconda3\envs\object_detection\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "D:\Anaconda3\envs\object_detection\lib\site-packages\absl\app.py", line 300, in run
_run_main(main, args)
File "D:\Anaconda3\envs\object_detection\lib\site-packages\absl\app.py", line 251, in _run_main
sys.exit(main(argv))
File "model_main_tf2.py", line 105, in main
hparams=model_hparams.create_hparams(FLAGS.hparams_overrides),
File "D:\indirilenler\models-2.3.0\models-2.3.0\research\object_detection\model_hparams.py", line 43, in create_hparams
hparams = contrib_training.HParams(
NameError: name 'contrib_training' is not defined
3. Steps to reproduce
You should try to train the centernet_resnet101_v1_fpn_512x512_coco17_tpu-8 model with model_main_tf2.py
4. Expected behavior
It should start to train the model
5. Additional context
Edit: model_builder_tf2_test.py completes successfully
6. System information