Problem with multi gpu (Integer division by zero)

5730279821-TA commented 6 years ago

I try to use multi gpu -> GPU_COUNT = 2 but in training process , it's show error "Integer division by zero" so I clone code from 4 month ago it work. , but now (update code) it not work.

error_multi_gpu

waleedka commented 6 years ago

Looking into it.

kirk86 commented 6 years ago

@5730279821-TA check this #423 if it solves your problem?

hongzhili commented 6 years ago

@kirk86 Got the same error and #423 doesn't solve the issue.

hongzhili commented 6 years ago

update: the problem is solved by updating TF from 1.3 to 1.7

kirk86 commented 6 years ago

@hongzhili Indeed when I was having that problem I was using TF 1.7 so I can't really comment on other TF versions, also the solution that I was proposing was tested with TF 1.7.

waleedka commented 6 years ago

Anyone using Python 2.7 by any chance? @gustavz reports that this causes a division by zero error here https://github.com/matterport/Mask_RCNN/issues/460

YubinXie commented 6 years ago

I am working on Python 3.6 and I have this error too. TF version 1.4. (Once I updated to TF1.7 in AWS deep learning instance, cuda lib can not be found)

kirk86 commented 6 years ago

@YubinXie try exposing cuda lib to your system via LD_LIBRARY_PATH. It happens sometime if you have multiple versions of cuda installed on the same system.

waleedka commented 6 years ago

In reference to the fix that @kirk86 mentioned above, which requires patching Keras. I pushed a fix that solves the problem without patching Keras.

This doesn't address the division by zero error, which I can't replicate yet. If anyone can replicate it and track it, please let me know.

YubinXie commented 6 years ago

@kirk86 LD_LIBRARY_PATH does not help. It seems to be a pip install tensorflow-gpu and conda install tensorflow-gpu thing. Installing the tensorflow by conda solves this problem.

minizon commented 6 years ago

I got the same problem with keras 2.1, python 3.6, and tf-gpu 1.4

2018-09-15 19:44:55.464381: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.194709: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.273291: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.369860: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.372454: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.372561: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.372649: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] Traceback (most recent call last): File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call return fn(*args) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn status, run_metadata) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] [[Node: training/SGD/gradients/tower_1/mask_rcnn/roi_align_mask/concat_grad/Gather_5/_4625 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_19668_training/SGD/gradients/tower_1/mask_rcnn/roi_align_mask/concat_grad/Gather_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

...which was originally created as op 'mrcnn_bbox_loss_1/concat', defined at: File "train_test_net.py", line 210, in model_dir=args.logdir) File "/mask_rcnn/model.py", line 96, in init self.keras_model = self.build(mode=mode, config=config) File "/mask_rcnn/model.py", line 329, in build model = ParallelModel(model, config.GPU_COUNT) File "/mask_rcnn/parallel_model.py", line 36, in init merged_outputs = self.make_parallel() File "/mask_rcnn/parallel_model.py", line 101, in make_parallel m = KL.Concatenate(axis=0, name=name)(outputs) File "/home/anaconda3/lib/python3.6/site-packages/keras/engine/base_layer.py", line 460, in call output = self.call(inputs, **kwargs) File "/home/anaconda3/lib/python3.6/site-packages/keras/layers/merge.py", line 155, in call return self._merge_function(inputs) File "/home/anaconda3/lib/python3.6/site-packages/keras/layers/merge.py", line 357, in _merge_function return K.concatenate(inputs, axis=self.axis) File "/home/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1923, in concatenate return tf.concat([to_dense(x) for x in tensors], axis) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1099, in concat return gen_array_ops._concat_v2(values=values, axis=axis, name=name) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 706, in _concat_v2 "ConcatV2", values=values, axis=axis, name=name) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

matterport / Mask_RCNN

Problem with multi gpu (Integer division by zero) #395