tensorflow / serving

A flexible, high-performance serving system for machine learning models
https://www.tensorflow.org/serving
Apache License 2.0
6.19k stars 2.19k forks source link

Could not launch cub::DeviceReduce::Sum to count number of true indices #627

Closed zacharynevin closed 7 years ago

zacharynevin commented 7 years ago

Environment

I pulled the environment information from the tf_env_collect.sh script offered here: https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh.

== cat /etc/issue ===============================================
Linux ip-172-31-64-152 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u3 (2017-08-15) x86_64 GNU/Linux
VERSION_ID="8"
VERSION="8 (jessie)"

== are we in docker =============================================
No

== compiler =====================================================
c++ (Debian 4.9.2-10) 4.9.2
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a =====================================================
Linux ip-172-31-64-152 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u3 (2017-08-15) x86_64 GNU/Linux

== check pips ===================================================
numpy (1.13.1)
protobuf (3.4.0)
tensorflow (1.3.0)
tensorflow-tensorboard (0.1.5)

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.3.0
tf.GIT_VERSION = unknown
tf.COMPILER_VERSION = unknown
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Fri Oct 20 23:33:16 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   58C    P0    57W / 149W |  10961MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     18939    C   ...rflow-serving/bin/tensorflow_model_server 10957MiB |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
/usr/local/cuda-8.0/lib64/libcudart_static.a

Additionally, I am using Bitnami to run tensorflow serving: https://docs.bitnami.com/general/infrastructure/tensorflowserving/

I used the following command to compile Tensorflow Serving with GPU support:

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k --jobs 6 --verbose_failures tensorflow_serving/model_servers:tensorflow_model_server

Problem

I have a model that does the following:

When I get to the tf.while_loop, I start to get strange Cub::DeviceReduce::Sum errors. This seems to specifically happen when I run tf.where operations.

These errors do not appear when I try to run the graph in Python with tensorflow-gpu support.

This is the error that appears:

WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices.  temp_storage_bytes: 1, status: invalid configuration argument
     [[Node: face_detector/bounding_boxes/nms_bounding_boxes/bbox_masked/Where = Where[_output_shapes=[[?,1]], _device="/job:localhost/replica:0/task:0/device:GPU:0"](face_detector/bounding_boxes/nms_bounding_boxes/bbox_masked/Reshape_1)]]
     [[Node: face_detector/bounding_boxes/nms_bounding_boxes/bbox_masked/Gather/_181 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_780_face_detector/bounding_boxes/nms_bounding_boxes/bbox_masked/Gather", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](^_cloopface_detector/bounding_boxes/nms_bounding_boxes/Gather_2/indices/_21)]]

Specifically, it says that there is an "invalid configuration argument". What I am asking is: what is the invalid configuration argument?. Could it be a mistake in the way I compiled Tensorflow Serving (i.e. in the options I specified)?

See the appendix below for description of the Tensorflow code for this part of the graph.

Appendix

For context, here is the tensorflow code corresponding to this part of the graph:

bboxes_tfarr = tf.TensorArray(tf.float32, size=batch_size, infer_shape=False)
partition_tfarr = tf.TensorArray(tf.int32, size=batch_size, infer_shape=False)

def cond(i, acc, pacc):
    return tf.less(i, batch_size)

def body(i, bbox_acc, partition_acc):
    bbox   = tf.reshape(tf.gather(bboxes, [i]), [-1, 4], name='bboxes')
    scores = tf.reshape(tf.gather(confs, [i]), [-1], name='scores')

    cond = tf.greater(scores, threshold, name='cond')

    conf_mask = tf.where(
        cond,
        tf.ones_like(scores),
        tf.zeros_like(scores),
        name='conf_mask'
    )

    conf_mask = tf.cast(conf_mask, tf.bool, name='conf_mask_bool')

    scores = tf.boolean_mask(scores, conf_mask, name='scores_masked')
    bbox = tf.boolean_mask(bbox, conf_mask, name='bbox_masked')

    number_of_faces = tf.reshape(tf.gather(tf.shape(scores), 0), [], name='number_of_faces')
    max_output_size = tf.reduce_min(tf.stack([number_of_faces, max_detectable_faces], axis=0), name='max_output_size')
    bbox_inds = tf.image.non_max_suppression(bbox, scores, iou_threshold=0.5, max_output_size=max_output_size, name='nms')
    bbox = tf.clip_by_value(tf.gather(bbox, bbox_inds, axis=0), 0.0, 1.0, name='bboxes_clipped')

    partition = tf.multiply(tf.ones_like(bbox_inds), i)

    return (tf.add(i, 1), bbox_acc.write(i, bbox), partition_acc.write(i, partition))

_, bboxes, partitions = tf.while_loop(
    cond,
    body,
    (tf.constant(0), bboxes_tfarr, partition_tfarr),
    name='nms_bounding_boxes'
)

bboxes     = tf.reshape(bboxes.concat(), [-1, 4], name='bboxes_reshaped')
partitions = tf.reshape(partitions.concat(), [-1], name='partitions_reshaped')

The code above is intended to do the following on each iteration of the tf.while_loop:

zacharynevin commented 7 years ago

I solved this. I had to use tf.device('cpu:0') for the while_loop. I noticed that some of the tensors in the loop body were mapped to the GPU while others were mapped to the CPU, in which case those tensors wouldn't have access to each others memory.

chiru83 commented 6 years ago

Hi I got this error while executing faster R-CNN. please help me to resolve this error. thanks in advance

018-01-11 06:05:49.066479: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function 2018-01-11 06:05:49.066923: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] 2018-01-11 06:05:49.094692: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] 2018-01-11 06:05:49.094692: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] 2018-01-11 06:05:49.094700: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] 2018-01-11 06:05:49.094705: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] 2018-01-11 06:05:49.094712: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] Traceback (most recent call last): File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn status, run_metadata) File "/home/anaconda3/lib/python3.6/contextlib.py", line 89, in exit next(self.gen) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where_device="/job:localhost/replica:0/task:0/gpu:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./tools/trainval_net.py", line 140, in max_iters=args.max_iters) File "/home/FasterRCNN/tf-faster-rcnn-master/tools/../lib/model/train_val.py", line 400, in train_net sw.train_model(sess, max_iters) File "/home/FasterRCNN/tf-faster-rcnn-master/tools/../lib/model/train_val.py", line 311, in train_model self.net.train_step(sess, blobs, train_op) File "/home/FasterRCNN/tf-faster-rcnn-master/tools/../lib/nets/network.py", line 465, in train_step feed_dict=feed_dict) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1124, in _run feed_dict_tensor, options, run_metadata) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run options, run_metadata) File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 2815, status: invalid device function [[Node: LOSS_default/Where = Where[_device="/job:localhost/replica:0/task:0/gpu:0"](LOSS_default/Not

Superlee506 commented 6 years ago

@chiru83 Do you fixed this error? I also meet this errors when I train Mask RCNN.

2018-05-13 00:55:05.452042: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices.  temp_storage_bytes: 1, status: invalid configuration argument
2018-05-13 00:55:05.452913: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices.  temp_storage_bytes: 1, status: invalid configuration argument
     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
2018-05-13 00:55:05.453128: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices.  temp_storage_bytes: 1, status: invalid configuration argument
     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
2018-05-13 00:55:05.453190: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices.  temp_storage_bytes: 1, status: invalid configuration argument
     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
2018-05-13 00:55:05.453238: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices.  temp_storage_bytes: 1, status: invalid configuration argument
     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
Traceback (most recent call last):
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices.  temp_storage_bytes: 1, status: invalid configuration argument
     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/test.py", line 153, in <module>
    print(sess.run(mrcnn_loss, feed_dict=feed_datas))
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices.  temp_storage_bytes: 1, status: invalid configuration argument
     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'fpn_maskrcnn_head/PyramidROIAlign/Where', defined at:
  File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/test.py", line 113, in <module>
    config.MASK_POOL_SIZE, config.NUM_CLASS, config.ANCHOR_STRIDES)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 52, in wrapper
    return func(*args, **kwargs)
  File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/model.py", line 617, in fpn_maskrcnn_head
    roi_features = PyramidROIAlign(rois, fpn_features, pool_size, features_strides)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 84, in wrapper
    return func(*args, **kwargs)
  File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/model.py", line 572, in PyramidROIAlign
    index = tf.where(tf.equal(leves,level))[:,0]
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 2439, in where
    return gen_array_ops.where(input=condition, name=name)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5930, in where
    "Where", input=input, name=name)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices.  temp_storage_bytes: 1, status: invalid configuration argument
     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]
     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Process finished with exit code 1
chiru83 commented 6 years ago

Not yet. But, I guess the errors may be due to older versions of the GPU. Drivers may need to be updated, but not sure though.

Best regards,

Pojala Chiranjeevi CR/RTC1.2-IN

Tel. +91 80 6101-3423

From: Superlee notifications@github.com Sent: Sunday, May 13, 2018 10:35 AM To: tensorflow/serving serving@noreply.github.com Cc: Chiranjeevi Pojala (CR/RTC1.2-IN) Chiranjeevi.Pojala@in.bosch.com; Mention mention@noreply.github.com Subject: Re: [tensorflow/serving] Could not launch cub::DeviceReduce::Sum to count number of true indices (#627)

@chiru83https://github.com/chiru83 Do you fixed this error? I also meet this errors when I train Mask RCNN.

2018-05-13 00:55:05.452042: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument

2018-05-13 00:55:05.452913: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument

     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]

2018-05-13 00:55:05.453128: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument

     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]

2018-05-13 00:55:05.453190: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument

     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]

2018-05-13 00:55:05.453238: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument

     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]

Traceback (most recent call last):

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call

return fn(*args)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn

status, run_metadata)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit

c_api.TF_GetCode(self.status.status))

tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument

     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]

     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/test.py", line 153, in

print(sess.run(mrcnn_loss, feed_dict=feed_datas))

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run

run_metadata_ptr)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run

feed_dict_tensor, options, run_metadata)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run

options, run_metadata)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call

raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument

     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]

     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'fpn_maskrcnn_head/PyramidROIAlign/Where', defined at:

File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/test.py", line 113, in

config.MASK_POOL_SIZE, config.NUM_CLASS, config.ANCHOR_STRIDES)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 113, in wrapper

return func(*args, **kwargs)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 52, in wrapper

return func(*args, **kwargs)

File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/model.py", line 617, in fpn_maskrcnn_head

roi_features = PyramidROIAlign(rois, fpn_features, pool_size, features_strides)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorpack/tfutils/scope_utils.py", line 84, in wrapper

return func(*args, **kwargs)

File "/home/chaoli/PycharmProjects/SuperCode/tensorpack-master/Tensorpack_Examples/Humanpose/model.py", line 572, in PyramidROIAlign

index = tf.where(tf.equal(leves,level))[:,0]

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 2439, in where

return gen_array_ops.where(input=condition, name=name)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5930, in where

"Where", input=input, name=name)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper

op_def=op_def)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op

op_def=op_def)

File "/home/chaoli/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in init

self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid configuration argument

     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where = Where[_device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_maskrcnn_head/PyramidROIAlign/Equal/_777)]]

     [[Node: fpn_maskrcnn_head/PyramidROIAlign/Where/_779 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2167_fpn_maskrcnn_head/PyramidROIAlign/Where", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Process finished with exit code 1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/tensorflow/serving/issues/627#issuecomment-388601868, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aas38d7G23AfoFv0ILp0ZUIZKTglDIHJks5tx78DgaJpZM4QBWVU.

cuijianaaa commented 5 years ago

I have solved the seem problem!!! There is no answer anywhere, may be there is some gpu code use a different gpu id from which used by tensorflow.