tryolabs / luminoth

Deep Learning toolkit for Computer Vision.
https://tryolabs.com
BSD 3-Clause "New" or "Revised" License
2.4k stars 400 forks source link

Incompatible shapes error #227

Open lexical opened 5 years ago

lexical commented 5 years ago

Hi, I am trying to train OpenImages V4 with 600 classes. The training stopped with the following error. Wonder if this error is from Luminoth. Any suggestion to get this fixed?

INFO:tensorflow:step: 1986, file: 0012c270e7a0d8e9, train_loss: 7.52168273926, in 15.08s
Traceback (most recent call last):
  File "/root/venv2/bin/lumi", line 11, in <module>
    sys.exit(cli())
  File "/root/venv2/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/root/venv2/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/root/venv2/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/venv2/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/venv2/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/root/venv2/local/lib/python2.7/site-packages/luminoth/train.py", line 307, in train
    config, environment=environment
  File "/root/venv2/local/lib/python2.7/site-packages/luminoth/train.py", line 239, in run
    ], options=run_options)  
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
    run_metadata=run_metadata)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1239, in run
    raise six.reraise(*original_exc_info)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
    return self._sess.run(*args, **kwargs)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1296, in run
    run_metadata=run_metadata)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
    return self._sess.run(*args, **kwargs)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 887, in run
    run_metadata_ptr)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1110, in _run
    feed_dict_tensor, options, run_metadata)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run
    run_metadata)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [3,4] vs. [9,4]
         [[{{node losses/RCNNLoss/sub_1}} = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](losses/RCNNLos
atherV2)]]

Caused by op u'losses/RCNNLoss/sub_1', defined at:
  File "/root/venv2/bin/lumi", line 11, in <module>
    sys.exit(cli())
  File "/root/venv2/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/root/venv2/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/root/venv2/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/venv2/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/venv2/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/root/venv2/local/lib/python2.7/site-packages/luminoth/train.py", line 307, in train
    config, environment=environment
  File "/root/venv2/local/lib/python2.7/site-packages/luminoth/train.py", line 67, in run
    total_loss = model.loss(prediction_dict)
  File "/root/venv2/local/lib/python2.7/site-packages/luminoth/models/fasterrcnn/fasterrcnn.py", line 192, in loss
    prediction_dict['classification_prediction']
  File "/root/venv2/local/lib/python2.7/site-packages/luminoth/models/fasterrcnn/rcnn.py", line 391, in loss
    sigma=self._l1_sigma
  File "/root/venv2/local/lib/python2.7/site-packages/luminoth/utils/losses.py", line 22, in smooth_l1_loss
    diff = bbox_prediction - bbox_target
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 862, in binary_op_wrapper
    return func(x, y, name=name)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 8318, in sub
    "Sub", x=x, y=y, name=name)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helpe
    op_def=op_def)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)
  File "/root/venv2/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [3,4] vs. [9,4]
         [[{{node losses/RCNNLoss/sub_1}} = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](losses/RCNNLos
atherV2)]]

(venv2) root@3e5d7b5d1a41:~#
sbochkar commented 5 years ago

Does the number of classes specified in the config match the number of classes in your dataset?

keng-yu commented 5 years ago

Yes, it matches. Or there will be different errors. Also tried COCO 2017 train dataset, the same error "InvalidArgumentError" happens too.

kshitijagrwl commented 5 years ago

I get a similar error on train,evaluate and predict as well on the COCO 2017 and COCO 2014 dataset. I have pytorch 0.4 installed.

r$ lumi train -c train.yml INFO:tensorflow:Training 279 vars from pretrained module; from "truncated_base_network/resnet_v1_101/block2/unit_1/bottleneck_v1/shortcut/weights:0" to "truncated_base_network/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/beta:0". /home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Starting training for <luminoth.models.fasterrcnn.fasterrcnn.FasterRCNN object at 0x7f0c69339b10> WARNING:tensorflow:From /home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/util/tf_should_use.py:118: initialize_local_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use tf.local_variables_initializer instead. INFO:tensorflow:ImageVisHook was created with mode = "train" INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2018-12-04 04:23:59.919368: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA 2018-12-04 04:24:02.272545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:81:00.0 totalMemory: 10.92GiB freeMemory: 10.76GiB 2018-12-04 04:24:02.647077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:9e:00.0 totalMemory: 10.92GiB freeMemory: 10.76GiB 2018-12-04 04:24:02.649202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1 2018-12-04 04:24:03.581040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-12-04 04:24:03.581113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 1 2018-12-04 04:24:03.581133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N Y 2018-12-04 04:24:03.581143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1: Y N 2018-12-04 04:24:03.581729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10413 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1) 2018-12-04 04:24:03.810598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10413 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:9e:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from /home/ml/.luminoth/resnet_v1_101.ckpt INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. Traceback (most recent call last): File "/home/ml/anaconda3/envs/plt-tf-py2/bin/lumi", line 11, in sys.exit(cli()) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/click/core.py", line 535, in invoke return callback(args, kwargs) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/luminoth/train.py", line 307, in train config, environment=environment File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/luminoth/train.py", line 239, in run ], options=run_options) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 567, in run run_metadata=run_metadata) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1043, in run run_metadata=run_metadata) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1134, in run raise six.reraise(original_exc_info) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1119, in run return self._sess.run(args, kwargs) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1191, in run run_metadata=run_metadata) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 971, in run return self._sess.run(*args, **kwargs) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [0,4] vs. [25,4] [[Node: losses/RCNNLoss/sub_1 = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](losses/RCNNLoss/bbox_offset_cleaned/GatherV2, losses/RCNNLoss/bbox_offsets_target_labeled/GatherV2)]] [[Node: fasterrcnn/rcnn/rcnn_proposal_1/GatherV2_16/_3769 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11223_fasterrcnn/rcnn/rcnn_proposal_1/GatherV2_16", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op u'losses/RCNNLoss/sub_1', defined at: File "/home/ml/anaconda3/envs/plt-tf-py2/bin/lumi", line 11, in sys.exit(cli()) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/click/core.py", line 535, in invoke return callback(args, **kwargs) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/luminoth/train.py", line 307, in train config, environment=environment File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/luminoth/train.py", line 67, in run total_loss = model.loss(prediction_dict) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/luminoth/models/fasterrcnn/fasterrcnn.py", line 192, in loss prediction_dict['classification_prediction'] File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/luminoth/models/fasterrcnn/rcnn.py", line 391, in loss sigma=self._l1_sigma File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/luminoth/utils/losses.py", line 22, in smooth_l1_loss diff = bbox_prediction - bbox_target File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 979, in binary_op_wrapper return func(x, y, name=name) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 8009, in sub "Sub", x=x, y=y, name=name) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op op_def=op_def) File "/home/ml/anaconda3/envs/plt-tf-py2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Incompatible shapes: [0,4] vs. [25,4] [[Node: losses/RCNNLoss/sub_1 = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](losses/RCNNLoss/bbox_offset_cleaned/GatherV2, losses/RCNNLoss/bbox_offsets_target_labeled/GatherV2)]] [[Node: fasterrcnn/rcnn/rcnn_proposal_1/GatherV2_16/_3769 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11223_fasterrcnn/rcnn/rcnn_proposal_1/GatherV2_16", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

meyerjo commented 5 years ago

I think this is fixed with #261