tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.75k forks source link

Error when training voc2012 with mask rcnn #3972

Closed Philip-Chen closed 6 years ago

Philip-Chen commented 6 years ago

The same error on all datasets and all mask models

System information

(tensorflow) philip_chen@Chen-Lenovo:~/TensorFlow/models/research$ CUDA_VISIBLE_DEVICES=1 python object_detection/train.py --logtostderr --pipeline_config_path=/home/philip_chen/TensorFlow/models/research/object_detection/mask_rcnn_inception_v2_coco_2018_01_28/mask_rcnn_inception_v2_coco.config --train_dir=/home/philip_chen/TensorFlow/models/research/object_detection/mask_rcnn_inception_v2_coco_2018_01_28/train

EDIT: (robieta) Moved full output to a separate file obj_detection_output.txt

/home/philip_chen/anaconda3/envs/tensorflow/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer.

...

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [2] [[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/packed)]]

hedeya1980 commented 6 years ago

I face the same error, and I really need help about how to solve it.

lulu12132017 commented 6 years ago

Me too.Has anyone solved it?

robieta commented 6 years ago

If you run without the checkpoint do you still get the assertion errors?

hedeya1980 commented 6 years ago

Hi @robieta , What do you mean by running without the checkpoint? Do you mean that I should set 'from_detection_checkpoint:' to 'false' in the configuration file?

When I did this, I got other errors.

Could you pls clarify?

robieta commented 6 years ago

What are the errors that you get when from_detection_checkpoint to false?

hedeya1980 commented 6 years ago

Hi @robieta, When I set from_detection_checkpoint to false (mask_rcnn_inception_resnet_v2_atrous_coco), I got the following erros:

EDIT: (robieta) Moved full output to a separate file obj_detection_output2.txt

C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " WARNING:root:Variable [InceptionResnetV2/Block8/Branch_0/Conv2d_1x1/BatchNorm/beta] is not available in checkpoint

...

WARNING:root:Variable [InceptionResnetV2/Repeat_2/block8_9/Conv2d_1x1/weights/Momentum] is not available in checkpoint Traceback (most recent call last): File "train.py", line 167, in tf.app.run() File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\Users\hedey\models\research\object_detection\trainer.py", line 352, in train init_saver = tf.train.Saver(available_var_map) File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1239, in init self.build() File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1248, in build self._build(self._filename, build_save=True, build_restore=True) File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1272, in _build raise ValueError("No variables to save") ValueError: No variables to save

lulu12132017 commented 6 years ago

Do not use checkpoint。like this

#fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
from_detection_checkpoint: false

you can try

hedeya1980 commented 6 years ago

Hi @lulu12132017 ,

Now, I get the following errors:

EDIT: (robieta) Moved full output to a separate file obj_detection_output3.txt

INFO:tensorflow:Error reported to Coordinator: assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1]

...

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1] [[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All/_133, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x/_135, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/shape/_137)]] [[Node: FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_3/Conv2d_0b_1x1/BatchNorm/beta/read/_305 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2367_FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_3/Conv2d_0b_1x1/BatchNorm/beta/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

hedeya1980 commented 6 years ago

Hi @lulu12132017 & @robieta,

I really need your help to get a solution for this, because I need to use the tensorflow object detection API in my master's project.

robieta commented 6 years ago

I'm going to close this and refer you to the tensorflow StackOverflow, as this appears to be a configuration issue rather than a clear bug in the object detection code.

If you think we've misinterpreted a bug, please comment again with a clear explanation, as well as all of the information requested in the issue template. Thanks!

SarvMangal commented 6 years ago

Although the issue is closed by Robieta, the solution isn't available anywhere. There are multiple bugs on this issue with no suggestion what the configuration is and what is the real way of solving this. Please help.

hedeya1980 commented 6 years ago

Hi @SarvMangal, I agree with you. We need help by getting a real way of solving this. Even after I followed @robieta's advice and posted at StackOverflow, I haven't received any replies yet. Here is my Stackoverflow post: https://stackoverflow.com/questions/50009709/assertion-failed-error-when-using-tensorflow-object-detection-api-to-fine-tune-t

SarvMangal commented 6 years ago

Isn't there any way of reopening this thread? Or I will add one more issue with all the required details.

Even if it is a configuration issue, the documentation is just not enough to help us solve the problem.

On Tue 8 May, 2018, 2:05 AM hedeya1980, notifications@github.com wrote:

Hi @SarvMangal https://github.com/SarvMangal, I agree with you. We need help by getting a real way of solving this. Even after I followed @robieta https://github.com/robieta's advice and posted at StackOverflow, I haven't received any replies yet. Here is my Stackoverflow post:

https://stackoverflow.com/questions/50009709/assertion-failed-error-when-using-tensorflow-object-detection-api-to-fine-tune-t

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/3972#issuecomment-387196983, or mute the thread https://github.com/notifications/unsubscribe-auth/AUNFigk--1MPYemBxoLQrVF3s8PsxYJxks5twLAngaJpZM4TTVyi .

lulu12132017 commented 6 years ago

When you convert the MIO-TCD dataset into TFRecord,you should set include_masks parameter like this. --include_masks=True You can try.

在 2018-05-08 04:35:51,"hedeya1980" notifications@github.com 写道:

Hi @SarvMangal, I agree with you. We need help by getting a real way of solving this. Even after I followed @robieta's advice and posted at StackOverflow, I haven't received any replies yet. Here is my Stackoverflow post: https://stackoverflow.com/questions/50009709/assertion-failed-error-when-using-tensorflow-object-detection-api-to-fine-tune-t

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

hedeya1980 commented 6 years ago

Hi @lulu12132017 , Thanks for your reply. However, could you pls clarify the following:

Abduoit commented 6 years ago

I have same issue I have created TFRecord files by using create_pet_tf_record.py now I am trying to train my dateset with mask_rcnn but I am getting same issue. Is there new suggestion please ?

Abduoit commented 6 years ago

@hedeya1980 I could not post my answer in your question in stackoverflow

I had this problem, I solved as follow:

The name of the TFRecords files should be pet_train/val.record. I changed it by editing the faces_only from True to False

check the line here https://github.com/tensorflow/models/blob/master/research/object_detection/dataset_tools/create_pet_tf_record.py#L49

Then, I regenerated TFRecord files by this

python object_detection/dataset_tools/create_pet_tf_record.py
 --label_map_path=object_detection/data/two_label_map.pbtxt 
--data_dir=`pwd`     --output_dir=`pwd` --include_masks=True

Then, I got two TFRecords files with names pet_train/val.record, then I used them for training process with mask_rcnn_inception_v2_coco

Hope this helps

Abduoit commented 6 years ago

I have this issue only when I use TFRecord files generated by create_pascal_tf_record.py. I don't have it when I use TFRecord files generated by create_pet_tf_record.py as I mentioned earlier. Is there any update?

wxianfeng commented 6 years ago

when i set faces_only from True to False

it's solved

what's faces_only means ?

erdag commented 6 years ago

I am still getting this error on this issue?.Has anybody figured this out yet?

NotFoundError (see above for traceback): Key Conv/biases/Momentum not found in checkpoint [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

leccyril commented 6 years ago

faces_only means we display only box on faces not on whole body, and no segmentation is made