Open 5730279821-TA opened 6 years ago
Looking into it.
@5730279821-TA check this #423 if it solves your problem?
@kirk86 Got the same error and #423 doesn't solve the issue.
update: the problem is solved by updating TF from 1.3 to 1.7
@hongzhili Indeed when I was having that problem I was using TF 1.7 so I can't really comment on other TF versions, also the solution that I was proposing was tested with TF 1.7.
Anyone using Python 2.7 by any chance? @gustavz reports that this causes a division by zero error here https://github.com/matterport/Mask_RCNN/issues/460
I am working on Python 3.6 and I have this error too. TF version 1.4. (Once I updated to TF1.7 in AWS deep learning instance, cuda lib can not be found)
@YubinXie try exposing cuda lib to your system via LD_LIBRARY_PATH
. It happens sometime if you have multiple versions of cuda installed on the same system.
In reference to the fix that @kirk86 mentioned above, which requires patching Keras. I pushed a fix that solves the problem without patching Keras.
This doesn't address the division by zero error, which I can't replicate yet. If anyone can replicate it and track it, please let me know.
@kirk86 LD_LIBRARY_PATH
does not help. It seems to be a pip install tensorflow-gpu and conda install tensorflow-gpu thing. Installing the tensorflow by conda solves this problem.
I got the same problem with keras 2.1, python 3.6, and tf-gpu 1.4
2018-09-15 19:44:55.464381: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.194709: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.273291: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.369860: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.372454: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.372561: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] 2018-09-15 19:44:57.372649: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] Traceback (most recent call last): File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call return fn(*args) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn status, run_metadata) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Integer division by zero [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]] [[Node: training/SGD/gradients/tower_1/mask_rcnn/roi_align_mask/concat_grad/Gather_5/_4625 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_19668_training/SGD/gradients/tower_1/mask_rcnn/roi_align_mask/concat_grad/Gather_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
...which was originally created as op 'mrcnn_bbox_loss_1/concat', defined at:
File "train_test_net.py", line 210, in
I try to use multi gpu -> GPU_COUNT = 2 but in training process , it's show error "Integer division by zero" so I clone code from 4 month ago it work. , but now (update code) it not work.