tryolabs / luminoth

Deep Learning toolkit for Computer Vision.
https://tryolabs.com
BSD 3-Clause "New" or "Revised" License
2.4k stars 399 forks source link

why i always suffer incompatible shapes warning[0,4] vs. [5,4] #151

Closed Phillylyu closed 6 years ago

Phillylyu commented 6 years ago

I am using luminoth, using imagenet to train the module. The config is: fastrrcnn, arc:resnet_v01_101 When I try to train the data, I always suffer the following info, and lumi train quit.

W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [0,4] vs. [5,4] [[Node: losses/RCNNLoss/sub_1 = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](losses/RCNNLoss/bbox_offset_cleaned/Gather, losses/RCNNLoss/bbox_offsets_target_labeled/Gather)]] Traceback (most recent call last): File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call return fn(*args) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn status, run_metadata) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [0,4] vs. [5,4] [[Node: losses/RCNNLoss/sub_1 = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](losses/RCNNLoss/bbox_offset_cleaned/Gather, losses/RCNNLoss/bbox_offsets_target_labeled/Gather)]] [[Node: Momentum/update/_13924 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14102_Momentum/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/anaconda3/envs/tflm/bin/lumi", line 11, in load_entry_point('luminoth', 'console_scripts', 'lumi')() File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/click/core.py", line 535, in invoke return callback(args, kwargs) File "/root/luminoth/luminoth/train.py", line 249, in train config, environment=environment File "/root/luminoth/luminoth/train.py", line 181, in run ], options=run_options) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 521, in run run_metadata=run_metadata) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 892, in run run_metadata=run_metadata) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 967, in run raise six.reraise(original_exc_info) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/six.py", line 693, in reraise raise value File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 952, in run return self._sess.run(args, kwargs) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1024, in run run_metadata=run_metadata) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 827, in run return self._sess.run(*args, **kwargs) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [0,4] vs. [5,4] [[Node: losses/RCNNLoss/sub_1 = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](losses/RCNNLoss/bbox_offset_cleaned/Gather, losses/RCNNLoss/bbox_offsets_target_labeled/Gather)]] [[Node: Momentum/update/_13924 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14102_Momentum/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'losses/RCNNLoss/sub_1', defined at: File "/root/anaconda3/envs/tflm/bin/lumi", line 11, in load_entry_point('luminoth', 'console_scripts', 'lumi')() File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/click/core.py", line 535, in invoke return callback(args, **kwargs) File "/root/luminoth/luminoth/train.py", line 249, in train config, environment=environment File "/root/luminoth/luminoth/train.py", line 66, in run total_loss = model.loss(prediction_dict) File "/root/luminoth/luminoth/models/fasterrcnn/fasterrcnn.py", line 188, in loss prediction_dict['classification_prediction'] File "/root/luminoth/luminoth/models/fasterrcnn/rcnn.py", line 370, in loss sigma=self._l1_sigma File "/root/luminoth/luminoth/utils/losses.py", line 22, in smooth_l1_loss diff = bbox_prediction - bbox_target File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper return func(x, y, name=name) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4636, in _sub "Sub", x=x, y=y, name=name) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/root/anaconda3/envs/tflm/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Incompatible shapes: [0,4] vs. [5,4] [[Node: losses/RCNNLoss/sub_1 = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](losses/RCNNLoss/bbox_offset_cleaned/Gather, losses/RCNNLoss/bbox_offsets_target_labeled/Gather)]] [[Node: Momentum/update/_13924 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14102_Momentum/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

vierja commented 6 years ago

hi @Phillylyu, can you share the config file you are using?

Phillylyu commented 6 years ago

@vierja .Thanks your response. Maybe it is the config mistake. I limit the class to 2, meanwhile ,some pic does not have annations. I enlarged the class, now it is ok. thank u.

arun-kumark commented 6 years ago

Hi, I am facing the same issue. I checked the number of classes inside sample_config.yml, here is the sample_config.yml below:

train:
  # Directory in which model checkpoints & summaries (for Tensorboard) will be saved
  job_dir: jobs/
  debug: True

dataset:
  type: object_detection
  # From which directory to read the dataset
  dir: dataset/TFRecords

model:
  type: fasterrcnn
  network:
    # Total number of classes to predict
    num_classes: 9

  base_network:
    # Which type of pretrained network to use
    architecture: resnet_v1_101
    # Should we train the pretrained network
    trainable: True
    # Should we download weights if not available
    download: True

But even then I am always facing the same issue as below:

INFO:tensorflow:step: 36, file: b'205.jpg', train_loss: 99.34402465820312, in 0.34s
INFO:tensorflow:step: 37, file: b'3150.jpg', train_loss: 195.69007873535156, in 0.35s
INFO:tensorflow:step: 38, file: b'309.jpg', train_loss: 122.05986022949219, in 0.36s
INFO:tensorflow:step: 39, file: b'230.jpg', train_loss: 215.76116943359375, in 0.35s
Traceback (most recent call last):
  File "/home/arun/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/arun/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/home/arun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [15,4] vs. [16,4]

Please help me, where else I need to change.

Kind Regards Arun

Phillylyu commented 6 years ago

I think, u should check the pic and annatiation to make sure they are matched.

在 2018年2月15日,上午4:45,Arun Kumar notifications@github.com 写道:

Hi, I am facing the same issue. I checked the number of classes inside sample_config.yml, here is teh sample_config.yml below:

train:

Directory in which model checkpoints & summaries (for Tensorboard) will be saved

job_dir: jobs/ debug: True

dataset: type: object_detection

From which directory to read the dataset

dir: dataset/TFRecords

model: type: fasterrcnn network:

Total number of classes to predict

num_classes: 9

base_network:

Which type of pretrained network to use

architecture: resnet_v1_101
# Should we train the pretrained network
trainable: True
# Should we download weights if not available
download: True

But even then I am always facing the same issue as below:

INFO:tensorflow:step: 36, file: b'205.jpg', train_loss: 99.34402465820312, in 0.34s INFO:tensorflow:step: 37, file: b'3150.jpg', train_loss: 195.69007873535156, in 0.35s INFO:tensorflow:step: 38, file: b'309.jpg', train_loss: 122.05986022949219, in 0.36s INFO:tensorflow:step: 39, file: b'230.jpg', train_loss: 215.76116943359375, in 0.35s Traceback (most recent call last): File "/home/arun/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call return fn(*args) File "/home/arun/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn status, run_metadata) File "/home/arun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [15,4] vs. [16,4]

Please help me, where else I need to change.

Kind Regards Arun

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tryolabs/luminoth/issues/151#issuecomment-365739434, or mute the thread https://github.com/notifications/unsubscribe-auth/AcO23MuMEHyyrJl6VgzXMdRn6E5_NF6gks5tU0XbgaJpZM4RzY6-.

arun-kumark commented 6 years ago

Thank you Phillylyu, The problem is solved sometimes back. Yes, you are right, some of the Annotation files had the problem.