tensorflow / models

Models and examples built with TensorFlow
Other
77.18k stars 45.75k forks source link

Cannot train the mask-rcnn models #3913

Closed sjwhhhi closed 4 years ago

sjwhhhi commented 6 years ago

I want to train a mask-rcnn models by my personal dataset. I use create_pascal_tf_record.py to make it in tf-format. However, I cannot train it with this error.

2018-04-09 13:52:34.408287: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **** 2018-04-09 13:52:34.408300: W tensorflow/core/framework/op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[4,160,56,67] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "train.py", line 167, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/sjw/models/object_detection/trainer.py", line 370, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 782, in train ignore_live_threads=ignore_live_threads) File "/usr/lib/python2.7/contextlib.py", line 35, in exit self.gen.throw(type, value, traceback) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 998, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 826, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 387, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 295, in stop_on_exception yield File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 492, in run self.run_loop() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1028, in run_loop self._sv.global_step]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1128, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1] [[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All/_2119, Loss/BoxClassifierLoss/assert_equal_1/Assert/Assert/data_0, Loss/BoxClassifierLoss/assert_equal_1/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x/_2121, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/BoxClassifierLoss/ones_1/shape/_129)]]

Caused by op u'Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert', defined at: File "train.py", line 167, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/sjw/models/object_detection/trainer.py", line 246, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/home/sjw/models/slim/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, *kwargs) File "/home/sjw/models/object_detection/trainer.py", line 181, in _create_losses losses_dict = detection_model.loss(prediction_dict, true_image_shapes) File "/home/sjw/models/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1580, in loss groundtruth_masks_list, File "/home/sjw/models/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1813, in _loss_box_classifier groundtruth_boxlists, groundtruth_masks_list) File "/home/sjw/models/object_detection/core/target_assigner.py", line 447, in batch_assign_targets anchors, gt_boxes, gt_class_targets, gt_weights) File "/home/sjw/models/object_detection/core/target_assigner.py", line 151, in assign groundtruth_boxes.get())[:1]) File "/home/sjw/models/object_detection/utils/shape_utils.py", line 279, in assert_shape_equal return tf.assert_equal(shape_a, shape_b) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/check_ops.py", line 392, in assert_equal return control_flow_ops.Assert(condition, data, summarize=summarize) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 118, in wrapped return _add_should_use_warning(fn(args, **kwargs)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 169, in Assert condition, data, summarize, name="Assert") File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 48, in _assert name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1625, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1] [[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All/_2119, Loss/BoxClassifierLoss/assert_equal_1/Assert/Assert/data_0, Loss/BoxClassifierLoss/assert_equal_1/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x/_2121, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/BoxClassifierLoss/ones_1/shape/_129)]]

And my tensorflow-gpu vesion is 1.5 in Ubuntu16. Could anyone help me? Thanks.

hedeya1980 commented 6 years ago

I'm facing the same error but in windows, and I really need help about how to solve it.

priya-dwivedi commented 6 years ago

Same here. Training on Mask RCNN Inception on Ubuntu using Python 2.7 and Tensorflow 1.5. Get the same error.

SarvMangal commented 6 years ago

https://github.com/tensorflow/models/issues/3972 also describes the same problem, as the assertion is seen in the trace here too.

The error is not seen if fast_rcnn is used with the same configuration file. but then we are trying to get maskrcnn working and not fast rcnn.

arpita-saha commented 6 years ago

@priya-dwivedi : Same error occurred. I am following your blog Custom_Mask_Rcnn and trying to train the model and got stuck in this error. Were you able to fix it ? If yes, please guide me through this

Abduoit commented 6 years ago

I had this problem, I solved as follow:

The name of the TFRecords files should be pet_train/val.record. I changed it by editing the faces_only from True to False

check the line here https://github.com/tensorflow/models/blob/master/research/object_detection/dataset_tools/create_pet_tf_record.py#L49

Then, I regenerated TFRecord files by this

python object_detection/dataset_tools/create_pet_tf_record.py
 --label_map_path=object_detection/data/two_label_map.pbtxt 
--data_dir=`pwd`     --output_dir=`pwd` --include_masks=True

Then, I got two TFRecords files with names pet_train/val.record, then I used them for training process with mask_rcnn_inception_v2_coco

Hope this helps

erdag commented 6 years ago

Getting same error, any update on this?

quxiaofeng commented 6 years ago

This error is caused by the data.

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1]

This line is trying to find the contour of the mask. An error here probably means the mask is not included or at least not found.

This may root in the creating of the tf-record.

You need manually check the create_tf_record.py line by line for errors. My lucky guess would be the string constants.

flags = tf.app.flags
flags.DEFINE_string('data_dir', '', 'Root directory to raw pet dataset.')
flags.DEFINE_string('output_dir', '', 'Path to directory to output TFRecords.')
flags.DEFINE_string('label_map_path', 'data/pet_label_map.pbtxt',
                    'Path to label map proto')
flags.DEFINE_boolean('faces_only', False, 'If True, generates bounding boxes '
                     'for pet faces.  Otherwise generates bounding boxes (as '
                     'well as segmentations for full pet bodies).  Note that '
                     'in the latter case, the resulting files are much larger.')
flags.DEFINE_string('mask_type', 'png', 'How to represent instance '
                    'segmentation masks. Options are "png" or "numerical".')
FLAGS = flags.FLAGS

And then train with this fix https://github.com/tensorflow/models/pull/4462/commits/e45234e32dbc485f74567f6c0297edc9c084677c in config.

This fix tells the trainer to read in PNG masks.

Please let me know if the fix works. The PR #4462 is still pending.

SpiralBeing commented 6 years ago

@quxiaofeng Unfortunately I've gotten other issue now (I've changed faces_only from True to False and got files: pet_train/val.record. And also smoothed images (data) to remove noises I've tried train with it and got somethink like this: default rrr r3 2

And also I can got smth like this: 2

It's awful...

SpiralBeing commented 6 years ago

@Abduoit Can you say please why here (https://github.com/tensorflow/models/blob/master/research/object_detection/dataset_tools/create_pet_tf_record.py#L49) we need in several TFrecords (flags.DEFINE_integer('num_shards', 10, 'Number of TFRecord shards')) ? I used only one train.record and one val.record (My create_pet_tf_record.py don't include line 'flags.DEFINE_integer('num_shards', 10, 'Number of TFRecord shards')' ). Is it crucially to get several records for train and for val (test)? P.S. faces_only = False

nonbackground_indices_x = np.any(mask_np == 1, axis=0)
nonbackground_indices_y = np.any(mask_np == 1, axis=1)

Please answer

quxiaofeng commented 6 years ago

@FreedoomFighter It is just an Out of Memory error. Your GPU does not have enough memory for this model.

SpiralBeing commented 6 years ago

@quxiaofeng Thank you. I wish you happiness

SpiralBeing commented 6 years ago

@quxiaofeng The last question please. I've trained model (during the training process I see that box and mask were training) but in jupyter notebook I can only observe box without mask..What is wrong? All I did as I wrote above (my pre-previous comment)

quxiaofeng commented 6 years ago

If you run the evaluation correctly, you should see the masked result images.

Another possible error is that the output you use does not output the mask. Maybe you could verify the output tensor or the module in the graph for the exact data getting out.

leccyril commented 6 years ago

Hi,

i have some issues, first mask is not displayed i use the create_pet_tf_record.py (i put var faces_only to False) and in train-?????-of-00010.record i have image/object/mask in all tf records(i think it is good)

but when i launch train, i can only see the boundboxes and not the mask.

i have /mask/images (jpg) mask/annotations/trimaps(png) mask/annotations/xmls(xml) mask/annotation/trainval.txt mask/test_images

when i launch detection or eval i don't see masks why ? and there is not mask graph.

where is mask graph ? on train tensorboard ? or eval tensorbard. Can we launch tensorboard with both train and eval data in one tensorboard ?

please i really need help.. 2 weeks on training with no mas data

leccyril commented 6 years ago

i become crazy because i don't see mask on evaluation .... have we to launch a specific command to add mask in train ? evaluation ? detection ?

leccyril commented 6 years ago

@quxiaofeng i could not see mask in train or eval all configuration as advice

arpita-saha commented 6 years ago

@leccyril check once whether mask information is getting saved in .record file or not. If it is getting saved attach create_tf_record file, I would like to look at it.

leccyril commented 6 years ago

hi, yes i attach the document, there is mask object path into ! num_shard is 10. so i join only first files.

i don't know what to do , i don't want to change the way to make it work (matterplot or coco dataset)...

tests.zip

arpita-saha commented 6 years ago

check whether your config file is correct. In order to get mask you have to use this . also I asked for script that you used for creating .record files

leccyril commented 6 years ago

ok i will do this, i use exactly the same file sauv.zip

you have file xml/png/jpg sample.

i really want to thank you because i am lost....

leccyril commented 6 years ago

i pulled the new architecture tensorflow models with eval and train in legecy folder and tried with the pet dataset sample ... and the problem still occurs. no mask is displayed !!!

this i a great tool i think it miss one configuration or one parameter ....

One precision installation on debian 9 with pip3 python 3.5 and CPU installation

arpita-saha commented 6 years ago

in your mask_rcnn_inception_v2.config file ther is one line missing ' number_of_stages: 3' .this line helps in processing mask. check the link link

leccyril commented 6 years ago

I will try it now an tell you in the day. You think only this configuration can make change ?

Do you know what is the difference in have only one record file (shard1) and have 10 (shard 10) ?

thank you very much for your time

leccyril commented 6 years ago

Wonderful, i can see mask in first step evaluation. how i miss this configuration.

chapeau bas !

leccyril commented 6 years ago

the mask is green , i created mask in yellow and only outline because i want only see the shape outline. Do you know how i can do ? it is not automatic when add the mask we created before ?

thank you very much i spent severa days to make it work...

leccyril commented 6 years ago

i not see the mask loss graph in eval tensorboard, it is normal ? thanks

arpita-saha commented 6 years ago

No, its not normal. You should get something in mask loss graph

On Mon 16 Jul, 2018, 5:56 PM leccyril, notifications@github.com wrote:

i not see the mask loss graph in eval tensorboard, it is normal ? thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/3913#issuecomment-405230553, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_VkYGCZi1nHpGiLw8xY-cZQI4fzDz3ks5uHIaLgaJpZM4TL8gf .

leccyril commented 6 years ago

How mix train data and eval data in tensorboard ? it is strange i see the bounding boxed filled completely by mask.... but not the png i have specified as mask and in tensorboard there is no mask_loss when i launch the eval.py script. any idea ? my files and configuration seems to be ok ? just the stage_evaluation was missing ?

i think it persists a problem because moreover there is not the mask loss... the boundind boxe are fullefully with the green color not really a mask

leccyril commented 6 years ago

ok it work i see loss... but mask cover entierly the bounding box

gulingfengze commented 6 years ago

@leccyril You may need to refer to the format of mask under the Oxford IIIT Pet data set

tensorflowbutler commented 4 years ago

Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

faridomarzadeh commented 4 years ago

getting the same error INFO:tensorflow:Error reported to Coordinator: assertion failed: [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_3/x:0) = ] [1067 800] [y (Loss/BoxClassifierLoss/assert_equal_3/y:0) = ] [800 1067]