sydsim / personlab-tf

implementation of PersonLab(https://arxiv.org/abs/1803.08225) using TF-slim
103 stars 19 forks source link

Failed to run training step #7

Open jrbasso opened 5 years ago

jrbasso commented 5 years ago

I downloaded the repo, went to all the steps of the setup notebook and while executing the example notebook I get an issue running the training: INFO:tensorflow:Error reported to Coordinator: indices[3,4,50,50,16] = [4, 51, 50, 16] does not index into param shape [5,51,51,17]

I use an AWS p2.xlarge and a p3.8xlarge to test and both gives the same error. I used the AWS Deep Learning AMI with Ubuntu.

Full output:

loading annotations into memory...
Done (t=15.25s)
creating index...
index created!
loading annotations into memory...
Done (t=7.31s)
creating index...
index created!
WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py:737: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
INFO:tensorflow:Restoring parameters from logs/sample/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path logs/sample/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Error reported to Coordinator: indices[3,4,50,50,16] = [4, 51, 50, 16] does not index into param shape [5,51,51,17]
     [[node GatherNd_5 (defined at /home/ubuntu/personlab-tf/personlab/util.py:33)  = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_7/BiasAdd, stack_48)]]

Caused by op 'GatherNd_5', defined at:
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 499, in start
    self.io_loop.start()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 132, in start
    self.asyncio_loop.run_forever()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/base_events.py", line 422, in run_forever
    self._run_once()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/base_events.py", line 1432, in _run_once
    handle._run()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 122, in _handle_events
    handler_func(fileobj, events)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2907, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-05888e252ab7>", line 8, in <module>
    train(mobilenet_v2_model, gen.loader, pm_check_path, log_dir)
  File "/home/ubuntu/personlab-tf/personlab/model.py", line 25, in train
    output, init_func = model_func(tensors['image'], checkpoint_path=checkpoint_path, is_training=True)
  File "/home/ubuntu/personlab-tf/personlab/models/mobilenet_v2.py", line 16, in mobilenet_v2_model
    res = model_base(model_output, inner_h, inner_w)
  File "/home/ubuntu/personlab-tf/personlab/models/model_base.py", line 36, in model_base
    lo_y = gather_bilinear(lo_y, lo_p, (inner_h, inner_w)) + lo_y
  File "/home/ubuntu/personlab-tf/personlab/util.py", line 33, in gather_bilinear
    r = tf.gather_nd(params, idx)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3240, in gather_nd
    "GatherNd", params=params, indices=indices, name=name)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[3,4,50,50,16] = [4, 51, 50, 16] does not index into param shape [5,51,51,17]
     [[node GatherNd_5 (defined at /home/ubuntu/personlab-tf/personlab/util.py:33)  = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_7/BiasAdd, stack_48)]]
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[3,4,50,50,16] = [4, 51, 50, 16] does not index into param shape [5,51,51,17]
     [[{{node GatherNd_5}} = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_7/BiasAdd, stack_48)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 495, in run
    self.run_loop()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1034, in run_loop
    self._sv.global_step])
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[3,4,50,50,16] = [4, 51, 50, 16] does not index into param shape [5,51,51,17]
     [[node GatherNd_5 (defined at /home/ubuntu/personlab-tf/personlab/util.py:33)  = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_7/BiasAdd, stack_48)]]

Caused by op 'GatherNd_5', defined at:
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 499, in start
    self.io_loop.start()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 132, in start
    self.asyncio_loop.run_forever()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/base_events.py", line 422, in run_forever
    self._run_once()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/base_events.py", line 1432, in _run_once
    handle._run()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 122, in _handle_events
    handler_func(fileobj, events)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2907, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-05888e252ab7>", line 8, in <module>
    train(mobilenet_v2_model, gen.loader, pm_check_path, log_dir)
  File "/home/ubuntu/personlab-tf/personlab/model.py", line 25, in train
    output, init_func = model_func(tensors['image'], checkpoint_path=checkpoint_path, is_training=True)
  File "/home/ubuntu/personlab-tf/personlab/models/mobilenet_v2.py", line 16, in mobilenet_v2_model
    res = model_base(model_output, inner_h, inner_w)
  File "/home/ubuntu/personlab-tf/personlab/models/model_base.py", line 36, in model_base
    lo_y = gather_bilinear(lo_y, lo_p, (inner_h, inner_w)) + lo_y
  File "/home/ubuntu/personlab-tf/personlab/util.py", line 33, in gather_bilinear
    r = tf.gather_nd(params, idx)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3240, in gather_nd
    "GatherNd", params=params, indices=indices, name=name)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[3,4,50,50,16] = [4, 51, 50, 16] does not index into param shape [5,51,51,17]
     [[node GatherNd_5 (defined at /home/ubuntu/personlab-tf/personlab/util.py:33)  = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_7/BiasAdd, stack_48)]]

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333     try:
-> 1334       return fn(*args)
   1335     except errors.OpError as e:

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1318       return self._call_tf_sessionrun(
-> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
   1320 

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1406         self._session, options, feed_dict, fetch_list, target_list,
-> 1407         run_metadata)
   1408 

InvalidArgumentError: indices[1,4,50,50,36] = [4, 49, 51, 5] does not index into param shape [5,51,51,17]
     [[{{node GatherNd_1}} = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_3/BiasAdd, stack_8)]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py in managed_session(self, master, config, start_standard_services, close_summary_writer)
    993           start_standard_services=start_standard_services)
--> 994       yield sess
    995     except Exception as e:

~/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py in train(train_op, logdir, train_step_fn, train_step_kwargs, log_every_n_steps, graph, master, is_chief, global_step, number_of_steps, init_op, init_feed_dict, local_init_op, init_fn, ready_op, summary_op, save_summaries_secs, summary_writer, startup_delay_steps, saver, save_interval_secs, sync_optimizer, session_config, session_wrapper, trace_every_n_steps, ignore_live_threads)
    769             total_loss, should_stop = train_step_fn(
--> 770                 sess, train_op, global_step, train_step_kwargs)
    771             if should_stop:

~/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py in train_step(sess, train_op, global_step, train_step_kwargs)
    486                                         options=trace_run_options,
--> 487                                         run_metadata=run_metadata)
    488   time_elapsed = time.time() - start_time

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    928       result = self._run(None, fetches, feed_dict, options_ptr,
--> 929                          run_metadata_ptr)
    930       if run_metadata:

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1151       results = self._do_run(handle, final_targets, final_fetches,
-> 1152                              feed_dict_tensor, options, run_metadata)
   1153     else:

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1327       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328                            run_metadata)
   1329     else:

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1347       message = error_interpolation.interpolate(message, self._graph)
-> 1348       raise type(e)(node_def, op, message)
   1349 

InvalidArgumentError: indices[1,4,50,50,36] = [4, 49, 51, 5] does not index into param shape [5,51,51,17]
     [[node GatherNd_1 (defined at /home/ubuntu/personlab-tf/personlab/util.py:33)  = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_3/BiasAdd, stack_8)]]

Caused by op 'GatherNd_1', defined at:
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 499, in start
    self.io_loop.start()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 132, in start
    self.asyncio_loop.run_forever()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/base_events.py", line 422, in run_forever
    self._run_once()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/base_events.py", line 1432, in _run_once
    handle._run()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 122, in _handle_events
    handler_func(fileobj, events)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2907, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-05888e252ab7>", line 8, in <module>
    train(mobilenet_v2_model, gen.loader, pm_check_path, log_dir)
  File "/home/ubuntu/personlab-tf/personlab/model.py", line 25, in train
    output, init_func = model_func(tensors['image'], checkpoint_path=checkpoint_path, is_training=True)
  File "/home/ubuntu/personlab-tf/personlab/models/mobilenet_v2.py", line 16, in mobilenet_v2_model
    res = model_base(model_output, inner_h, inner_w)
  File "/home/ubuntu/personlab-tf/personlab/models/model_base.py", line 30, in model_base
    mo_y = gather_bilinear(so_y, mo_p, (inner_h, inner_w)) + mo_y
  File "/home/ubuntu/personlab-tf/personlab/util.py", line 33, in gather_bilinear
    r = tf.gather_nd(params, idx)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3240, in gather_nd
    "GatherNd", params=params, indices=indices, name=name)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[1,4,50,50,36] = [4, 49, 51, 5] does not index into param shape [5,51,51,17]
     [[node GatherNd_1 (defined at /home/ubuntu/personlab-tf/personlab/util.py:33)  = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_3/BiasAdd, stack_8)]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-2-05888e252ab7> in <module>()
      6 log_dir = 'logs/sample/'
      7 
----> 8 train(mobilenet_v2_model, gen.loader, pm_check_path, log_dir)

~/personlab-tf/personlab/model.py in train(model_func, data_generator, checkpoint_path, log_dir)
     77                                    log_every_n_steps=100,
     78                                    save_summaries_secs=300,
---> 79                                    session_config=sess_config,
     80                                   )
     81 

~/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py in train(train_op, logdir, train_step_fn, train_step_kwargs, log_every_n_steps, graph, master, is_chief, global_step, number_of_steps, init_op, init_feed_dict, local_init_op, init_fn, ready_op, summary_op, save_summaries_secs, summary_writer, startup_delay_steps, saver, save_interval_secs, sync_optimizer, session_config, session_wrapper, trace_every_n_steps, ignore_live_threads)
    783               threads,
    784               close_summary_writer=True,
--> 785               ignore_live_threads=ignore_live_threads)
    786 
    787     except errors.AbortedError:

~/anaconda3/lib/python3.6/contextlib.py in __exit__(self, type, value, traceback)
     97                 value = type()
     98             try:
---> 99                 self.gen.throw(type, value, traceback)
    100             except StopIteration as exc:
    101                 # Suppress StopIteration *unless* it's the same exception that

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py in managed_session(self, master, config, start_standard_services, close_summary_writer)
   1002         # threads which are not checking for `should_stop()`.  They
   1003         # will be stopped when we close the session further down.
-> 1004         self.stop(close_summary_writer=close_summary_writer)
   1005       finally:
   1006         # Close the session to finish up all pending calls.  We do not care

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py in stop(self, threads, close_summary_writer, ignore_live_threads)
    830           threads,
    831           stop_grace_period_secs=self._stop_grace_secs,
--> 832           ignore_live_threads=ignore_live_threads)
    833     finally:
    834       # Close the writer last, in case one of the running threads was using it.

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py in join(self, threads, stop_grace_period_secs, ignore_live_threads)
    387       self._registered_threads = set()
    388       if self._exc_info_to_raise:
--> 389         six.reraise(*self._exc_info_to_raise)
    390       elif stragglers:
    391         if ignore_live_threads:

~/anaconda3/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py in stop_on_exception(self)
    295     """
    296     try:
--> 297       yield
    298     except:  # pylint: disable=bare-except
    299       self.request_stop(ex=sys.exc_info())

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py in run(self)
    493         while not self._coord.wait_for_stop(next_timer_time - time.time()):
    494           next_timer_time += self._timer_interval_secs
--> 495           self.run_loop()
    496       self.stop_loop()
    497 

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py in run_loop(self)
   1032     if self._sv.global_step is not None:
   1033       summary_strs, global_step = self._sess.run([self._sv.summary_op,
-> 1034                                                   self._sv.global_step])
   1035     else:
   1036       summary_strs = self._sess.run(self._sv.summary_op)

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    927     try:
    928       result = self._run(None, fetches, feed_dict, options_ptr,
--> 929                          run_metadata_ptr)
    930       if run_metadata:
    931         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1150     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1151       results = self._do_run(handle, final_targets, final_fetches,
-> 1152                              feed_dict_tensor, options, run_metadata)
   1153     else:
   1154       results = []

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1326     if handle is None:
   1327       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328                            run_metadata)
   1329     else:
   1330       return self._do_call(_prun_fn, handle, feeds, fetches)

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1346           pass
   1347       message = error_interpolation.interpolate(message, self._graph)
-> 1348       raise type(e)(node_def, op, message)
   1349 
   1350   def _extend_graph(self):

InvalidArgumentError: indices[3,4,50,50,16] = [4, 51, 50, 16] does not index into param shape [5,51,51,17]
     [[node GatherNd_5 (defined at /home/ubuntu/personlab-tf/personlab/util.py:33)  = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_7/BiasAdd, stack_48)]]

Caused by op 'GatherNd_5', defined at:
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 499, in start
    self.io_loop.start()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 132, in start
    self.asyncio_loop.run_forever()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/base_events.py", line 422, in run_forever
    self._run_once()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/base_events.py", line 1432, in _run_once
    handle._run()
  File "/home/ubuntu/anaconda3/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 122, in _handle_events
    handler_func(fileobj, events)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2907, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-05888e252ab7>", line 8, in <module>
    train(mobilenet_v2_model, gen.loader, pm_check_path, log_dir)
  File "/home/ubuntu/personlab-tf/personlab/model.py", line 25, in train
    output, init_func = model_func(tensors['image'], checkpoint_path=checkpoint_path, is_training=True)
  File "/home/ubuntu/personlab-tf/personlab/models/mobilenet_v2.py", line 16, in mobilenet_v2_model
    res = model_base(model_output, inner_h, inner_w)
  File "/home/ubuntu/personlab-tf/personlab/models/model_base.py", line 36, in model_base
    lo_y = gather_bilinear(lo_y, lo_p, (inner_h, inner_w)) + lo_y
  File "/home/ubuntu/personlab-tf/personlab/util.py", line 33, in gather_bilinear
    r = tf.gather_nd(params, idx)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3240, in gather_nd
    "GatherNd", params=params, indices=indices, name=name)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[3,4,50,50,16] = [4, 51, 50, 16] does not index into param shape [5,51,51,17]
     [[node GatherNd_5 (defined at /home/ubuntu/personlab-tf/personlab/util.py:33)  = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_7/BiasAdd, stack_48)]]
alexryan commented 5 years ago

I have this error as well.

`INFO:tensorflow:Error reported to Coordinator: flat indices[1728695, :] = [2, 25, 51, 6] does not index into param (shape: [5,51,51,17]). [[Node: GatherNd_1 = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_3/BiasAdd, stack_8)]]

Caused by op 'GatherNd_1', defined at: File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in app.launch_new_instance() File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance app.start() File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 505, in start self.io_loop.start() File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 132, in start self.asyncio_loop.run_forever() File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/asyncio/base_events.py", line 422, in run_forever self._run_once() File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/asyncio/base_events.py", line 1434, in _run_once handle._run() File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/asyncio/events.py", line 145, in _run self._callback(self._args) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tornado/ioloop.py", line 758, in _run_callback ret = callback() File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper return fn(args, kwargs) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tornado/gen.py", line 1233, in inner self.run() File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run yielded = self.gen.send(value) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 357, in process_one yield gen.maybe_future(dispatch(args)) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper yielded = next(result) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 267, in dispatch_shell yield gen.maybe_future(handler(stream, idents, msg)) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper yielded = next(result) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 534, in execute_request user_expressions, allow_stdin, File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper yielded = next(result) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 294, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 536, in run_cell return super(ZMQInteractiveShell, self).run_cell(args, kwargs) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2819, in run_cell raw_cell, store_history, silent, shell_futures) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2845, in _run_cell return runner(coro) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/IPython/core/async_helpers.py", line 67, in _pseudo_sync_runner coro.send(None) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3020, in run_cell_async interactivity=interactivity, compiler=compiler, result=result) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3191, in run_ast_nodes if (yield from self.run_code(code, result)): File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 8, in train(mobilenet_v2_model, gen.loader, pm_check_path, log_dir) File "/home/ubuntu/personlab-tf/personlab/model.py", line 25, in train output, init_func = model_func(tensors['image'], checkpoint_path=checkpoint_path, is_training=True) File "/home/ubuntu/personlab-tf/personlab/models/mobilenet_v2.py", line 16, in mobilenet_v2_model res = model_base(model_output, inner_h, inner_w) File "/home/ubuntu/personlab-tf/personlab/models/model_base.py", line 30, in model_base mo_y = gather_bilinear(so_y, mo_p, (inner_h, inner_w)) + mo_y File "/home/ubuntu/personlab-tf/personlab/util.py", line 33, in gather_bilinear r = tf.gather_nd(params, idx) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3052, in gather_nd "GatherNd", params=params, indices=indices, name=name) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func return func(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op op_def=op_def) File "/home/ubuntu/miniconda3/envs/personlab-tf/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): flat indices[1728695, :] = [2, 25, 51, 6] does not index into param (shape: [5,51,51,17]). [[Node: GatherNd_1 = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Conv_3/BiasAdd, stack_8)]]`

sydsim commented 5 years ago

@jrbasso @alexryan sorry to reply you too late. it seems there is a bug that making the offset vectors to be out of boundary. it occurs only in CPU environment, and ignored in GPU enviroment. (https://github.com/tensorflow/tensorflow/issues/15091) I'll try to fix it in as soon as possible. if you find how to fix it, please send pull request.

jrbasso commented 5 years ago

@sydsim I actually ran into this issue using a GPU environment. Do you have any ideal on what can I try to mitigate that? Or any clue that I can research and try to fix it? Thanks.