Closed zacharynevin closed 7 years ago
I solved this. I had to use tf.device('cpu:0')
for the while_loop
. I noticed that some of the tensors in the loop body were mapped to the GPU while others were mapped to the CPU, in which case those tensors wouldn't have access to each others memory.
Hi I got this error while executing faster R-CNN. please help me to resolve this error. thanks in advance
018-01-11 06:05:49.066479: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function 2018-01-11 06:05:49.066923: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] 2018-01-11 06:05:49.094692: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] 2018-01-11 06:05:49.094692: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] 2018-01-11 06:05:49.094700: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] 2018-01-11 06:05:49.094705: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] 2018-01-11 06:05:49.094712: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] Traceback (most recent call last): File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn status, run_metadata) File "/home/anaconda3/lib/python3.6/contextlib.py", line 89, in exit next(self.gen) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./tools/trainval_net.py", line 140, in
@chiru83 Do you fixed this error? I also meet this errors when I train Mask RCNN.
2018-05-13 00:55:05.452042: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
2018-05-13 00:55:05.452913: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
2018-05-13 00:55:05.453128: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
2018-05-13 00:55:05.453190: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
2018-05-13 00:55:05.453238: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
Traceback (most recent call last):
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/test.py", line 153, in <module>
print(sess.run(mrcnn_loss, feed_dict=feed_datas))
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'fpn_maskrcnn_head/PyramidROIAlign/Where', defined at:
File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/test.py", line 113, in <module>
config.MASK_POOL_SIZE, config.NUM_CLASS, config.ANCHOR_STRIDES)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 113, in wrapper
return func(*args, **kwargs)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 52, in wrapper
return func(*args, **kwargs)
File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/model.py", line 617, in fpn_maskrcnn_head
roi_features = PyramidROIAlign(rois, fpn_features, pool_size, features_strides)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 84, in wrapper
return func(*args, **kwargs)
File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/model.py", line 572, in PyramidROIAlign
index = tf.where(tf.equal(leves,level))[:,0]
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 2439, in where
return gen_array_ops.where(input=condition, name=name)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5930, in where
"Where", input=input, name=name)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Process finished with exit code 1
Not yet. But, I guess the errors may be due to older versions of the GPU. Drivers may need to be updated, but not sure though.
Best regards,
Pojala Chiranjeevi CR/RTC1.2-IN
Tel. +91 80 6101-3423
From: Superlee notifications@github.com Sent: Sunday, May 13, 2018 10:35 AM To: tensorflow/serving serving@noreply.github.com Cc: Chiranjeevi Pojala (CR/RTC1.2-IN) Chiranjeevi.Pojala@in.bosch.com; Mention mention@noreply.github.com Subject: Re: [tensorflow/serving] Could not launch cub::DeviceReduce::Sum to count number of true indices (#627)
@chiru83https://github.com/chiru83 Do you fixed this error? I also meet this errors when I train Mask RCNN.
2018-05-13 00:55:05.452042: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
2018-05-13 00:55:05.452913: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
2018-05-13 00:55:05.453128: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
2018-05-13 00:55:05.453190: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
2018-05-13 00:55:05.453238: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
Traceback (most recent call last):
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/test.py", line 153, in
print(sess.run(mrcnn_loss, feed_dict=feed_datas))
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'fpn_maskrcnn_head/PyramidROIAlign/Where', defined at:
File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/test.py", line 113, in
config.MASK_POOL_SIZE, config.NUM_CLASS, config.ANCHOR_STRIDES)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 113, in wrapper
return func(*args, **kwargs)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 52, in wrapper
return func(*args, **kwargs)
File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/model.py", line 617, in fpn_maskrcnn_head
roi_features = PyramidROIAlign(rois, fpn_features, pool_size, features_strides)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 84, in wrapper
return func(*args, **kwargs)
File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/model.py", line 572, in PyramidROIAlign
index = tf.where(tf.equal(leves,level))[:,0]
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 2439, in where
return gen_array_ops.where(input=condition, name=name)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5930, in where
"Where", input=input, name=name)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
[[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Process finished with exit code 1
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/tensorflow/serving/issues/627#issuecomment-388601868, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aas38d7G23AfoFv0ILp0ZUIZKTglDIHJks5tx78DgaJpZM4QBWVU.
I have solved the seem problem!!! There is no answer anywhere, may be there is some gpu code use a different gpu id from which used by tensorflow.
Environment
I pulled the environment information from the
tf_env_collect.sh
script offered here: https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh.Additionally, I am using Bitnami to run tensorflow serving: https://docs.bitnami.com/general/infrastructure/tensorflowserving/
I used the following command to compile Tensorflow Serving with GPU support:
Problem
I have a model that does the following:
tf.map_fn
loop, convert the encoded png images to NHWC images. This part is fine.tf.image.non_max_suppression
in atf.while_loop
. This is where the problems appear.When I get to the
tf.while_loop
, I start to get strangeCub::DeviceReduce::Sum
errors. This seems to specifically happen when I runtf.where
operations.These errors do not appear when I try to run the graph in Python with
tensorflow-gpu
support.This is the error that appears:
Specifically, it says that there is an "invalid configuration argument". What I am asking is: what is the invalid configuration argument?. Could it be a mistake in the way I compiled Tensorflow Serving (i.e. in the options I specified)?
See the appendix below for description of the Tensorflow code for this part of the graph.
Appendix
For context, here is the tensorflow code corresponding to this part of the graph:
The code above is intended to do the following on each iteration of the
tf.while_loop
:bboxes
tensor has the shape[None, 960, 4]
(i.e. 960 max bboxes). Thescores
tensor has the shape[None, 960]
(i.e. score for each bbox)tf.where
on thescores
tensor to produce a boolean mask with which I can filter down thebboxes
andscores
tensors. If I have, for example, only 5 elements in thescores
tensor that are above the threshold, then I would expect thebboxes
andscores
tensors to now have the shapes[None, 5, 4]
and[None, 5]
, respectively, after thetf.boolean_mask
operation.max_output_size
chooses the maximum number of bounding boxes to extract usingtf.image.non_max_suppression
. I have the constantmax_detectable_faces
that caps this.bboxes
based on the indices returned bytf.image.non_max_suppression
.partitions
tensor is a1-D
tensor where each element is the position in the original batch of each bounding box. This allows me to later usetf.image.crop_and_resize
on the images.