mks0601 / PoseFix_RELEASE

Official TensorFlow implementation of "PoseFix: Model-agnostic General Human Pose Refinement Network", CVPR 2019
MIT License
329 stars 64 forks source link

About training on COCO #8

Closed BruceLeeeee closed 5 years ago

BruceLeeeee commented 5 years ago

Hi, Thanks for your work. I tried to train on coco dataset and only changed dataset in default config, but I encountered the error as follow:

07-01 15:14:24 Initialize saver ...
07-01 15:14:27 Initialize all variables ...
07-01 15:14:39 Initialized model weights from /root/lsh2/PoseFix_RELEASE/main/../data/imagenet_weights/resnet_v1_152.ckpt ...
07-01 15:14:55 Start training ...
2019-07-01 15:15:19.420659: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at scatter_nd_op.cc:119 : Invalid argument: indices[3] = [0, 0, 159, 3] does not index into shape [16,96,72,17]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[3] = [0, 0, 159, 3] does not index into shape [16,96,72,17]
     [[{{node tower_0/ScatterNd}} = ScatterNd[T=DT_FLOAT, Tindices=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/Cast, tower_0/concat_8, tower_0/ScatterNd/shape)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/lsh2/PoseFix_RELEASE/main/train.py", line 31, in <module>
    trainer.train()
  File "/root/lsh2/PoseFix_RELEASE/main/../lib/tfflat/base.py", line 449, in train
    [self.graph_ops[0], self.lr, *self.summary_dict.values()], feed_dict=feed_dict)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[3] = [0, 0, 159, 3] does not index into shape [16,96,72,17]
     [[node tower_0/ScatterNd (defined at /root/lsh2/PoseFix_RELEASE/main/model.py:108)  = ScatterNd[T=DT_FLOAT, Tindices=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/Cast, tower_0/concat_8, tower_0/ScatterNd/shape)]]

Caused by op 'tower_0/ScatterNd', defined at:
  File "/root/lsh2/PoseFix_RELEASE/main/train.py", line 30, in <module>
    trainer = Trainer(Model(), cfg)
  File "/root/lsh2/PoseFix_RELEASE/main/../lib/tfflat/base.py", line 195, in __init__
    super(Trainer, self).__init__(net, cfg, data_iter, log_name='train_logs.txt')
  File "/root/lsh2/PoseFix_RELEASE/main/../lib/tfflat/base.py", line 125, in __init__
    self.build_graph()
  File "/root/lsh2/PoseFix_RELEASE/main/../lib/tfflat/base.py", line 142, in build_graph
    self.graph_ops = self._make_graph()
  File "/root/lsh2/PoseFix_RELEASE/main/../lib/tfflat/base.py", line 382, in _make_graph
    self.net.make_network(is_train=True)
  File "/root/lsh2/PoseFix_RELEASE/main/model.py", line 156, in make_network
    self.render_onehot_heatmap(target_coord, cfg.output_shape),\
  File "/root/lsh2/PoseFix_RELEASE/main/model.py", line 108, in render_onehot_heatmap
    heatmap = tf.scatter_nd(indices, probs, (batch_size, *output_shape, cfg.num_kps))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 7077, in scatter_nd
    "ScatterNd", indices=indices, updates=updates, shape=shape, name=name)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[3] = [0, 0, 159, 3] does not index into shape [16,96,72,17]
     [[node tower_0/ScatterNd (defined at /root/lsh2/PoseFix_RELEASE/main/model.py:108)  = ScatterNd[T=DT_FLOAT, Tindices=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/Cast, tower_0/concat_8, tower_0/ScatterNd/shape)]]

Thanks for your time.

coordxyz commented 5 years ago

I had same problem. Did you solve the problem and how? Thanks~

mks0601 commented 5 years ago

can you describe in a more detailed way? what did you change from original code?

mks0601 commented 5 years ago

Did you train model on COCO and tried to test on other dataset?

BruceLeeeee commented 5 years ago

@bhyzhao @mks0601 I only changed Config.dataset to COCO. I think tf.scatter_nd in render_onehot_headmap() doesn't check indices for out-of-bounds induces the error. If I clip the value of target_coord, it works.

mks0601 commented 5 years ago

which version of TF are you using? according to my experience and doc (https://www.tensorflow.org/api_docs/python/tf/scatter_nd), in case of GPU, out of box indices are ignored.

BruceLeeeee commented 5 years ago

I have tested on tensorfow==1.12 tensorflow==1.14 and tensorFlow-gpu==1.14, and all have the same error. According to the doc, it should works, but I don't know why.

mks0601 commented 5 years ago

I also used 1.12 when implementing PoseFix. That is weird.. Can you tell me how did you clip the coordinates?

BruceLeeeee commented 5 years ago

I am not sure if it is correct, I think those out-of-bounds points are invalid, so it would not affect loss, right? `

def render_onehot_heatmap(self, coord, output_shape):
    batch_size = tf.shape(coord)[0]

    x = tf.reshape(coord[:,:,0] / cfg.input_shape[1] * output_shape[1],[-1])
    y = tf.reshape(coord[:,:,1] / cfg.input_shape[0] * output_shape[0],[-1])
    x_floor = tf.floor(x)
    y_floor = tf.floor(y)

    x_floor = tf.clip_by_value(x_floor, 0, output_shape[1] - 2)  # fix out-of-bounds x
    y_floor = tf.clip_by_value(y_floor, 0, output_shape[0] - 2)  # fix out-of-bounds y

    indices_batch = tf.expand_dims(tf.to_float(\
            tf.reshape(
            tf.transpose(\
            tf.tile(\
            tf.expand_dims(tf.range(batch_size),0)\
            ,[cfg.num_kps,1])\
            ,[1,0])\
            ,[-1])),1)
    indices_batch = tf.concat([indices_batch, indices_batch, indices_batch, indices_batch], axis=0)
    indices_joint = tf.to_float(tf.expand_dims(tf.tile(tf.range(cfg.num_kps),[batch_size]),1))
    indices_joint = tf.concat([indices_joint, indices_joint, indices_joint, indices_joint], axis=0)

    indices_lt = tf.concat([tf.expand_dims(y_floor,1), tf.expand_dims(x_floor,1)], axis=1)
    indices_lb = tf.concat([tf.expand_dims(y_floor+1,1), tf.expand_dims(x_floor,1)], axis=1)
    indices_rt = tf.concat([tf.expand_dims(y_floor,1), tf.expand_dims(x_floor+1,1)], axis=1)
    indices_rb = tf.concat([tf.expand_dims(y_floor+1,1), tf.expand_dims(x_floor+1,1)], axis=1)

    indices = tf.concat([indices_lt, indices_lb, indices_rt, indices_rb], axis=0)
    indices = tf.cast(tf.concat([indices_batch, indices, indices_joint], axis=1),tf.int32)

    prob_lt = (1 - (x - x_floor)) * (1 - (y - y_floor))
    prob_lb = (1 - (x - x_floor)) * (y - y_floor)
    prob_rt = (x - x_floor) * (1 - (y - y_floor))
    prob_rb = (x - x_floor) * (y - y_floor)
    probs = tf.concat([prob_lt, prob_lb, prob_rt, prob_rb], axis=0)

    heatmap = tf.scatter_nd(indices, probs, (batch_size, *output_shape, cfg.num_kps))
    normalizer = tf.reshape(tf.reduce_sum(heatmap,axis=[1,2]),[batch_size,1,1,cfg.num_kps])
    normalizer = tf.where(tf.equal(normalizer,0),tf.ones_like(normalizer),normalizer)
    heatmap = heatmap / normalizer

    return heatmap 

`

mks0601 commented 5 years ago

Yes they would not effect loss because there is also target_valid