localminimum / R-net

A Tensorflow Implementation of R-net: Machine reading comprehension with self matching networks
MIT License
323 stars 122 forks source link

Training failed after 8% with error "InvalidArgumentError (see above for traceback): indices[32,70] = 91604 is not in [0, 91604)" #29

Closed brojokm closed 6 years ago

brojokm commented 6 years ago

After 8% of training its failed..

Dev_loss: 3.72519350052 Dev_Exact_match: 0.1 Dev_F1_score: 0.178666666667 8%|██▏ | 350/4129 [2:30:47<27:08:09, 25.85s/b]Traceback (most recent call last): File "model.py", line 293, in main() File "model.py", line 269, in main index, dev_loss = sess.run([model.output_index, model.mean_loss], feed_dict = feed_dict) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1140, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[32,70] = 91604 is not in [0, 91604) [[Node: passage_embeddings/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@word_embeddings"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](word_embeddings/read, _arg_batch_0_0)]]

Caused by op u'passage_embeddings/embedding_lookup', defined at: File "model.py", line 293, in main() File "model.py", line 243, in main model = Model(is_training = True); print("Built model") File "model.py", line 70, in init self.encode_ids() File "model.py", line 95, in encode_ids scope = "passage_embeddings") File "/home/R-net-NW/layers.py", line 48, in encoding word_encoding = tf.nn.embedding_lookup(word_embeddings, word) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 327, in embedding_lookup transform_fn=None) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 151, in _embedding_lookup_and_transform result = _clip(_gather(params[0], ids, name=name), ids, max_norm) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 55, in _gather return array_ops.gather(params, ids, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 2698, in gather params, indices, validate_indices=validate_indices, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2672, in gather validate_indices=validate_indices, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1654, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): indices[32,70] = 91604 is not in [0, 91604) [[Node: passage_embeddings/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@word_embeddings"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](word_embeddings/read, _arg_batch_0_0)]]

ghost commented 6 years ago

As mentioned in #28 try increasing the dictionary size to 91605 from hyperparameters.py and delete "data/train" and "data/dev" and run python process.py -p True again to reprocess data.

brojokm commented 6 years ago

@minsangkim142 Now getting the follwing error..

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [91605,300] rhs shape= [91604,300] [[Node: save/Assign_270 = Assign[T=DT_FLOAT, _class=["loc:@word_embeddings"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](word_embeddings, save/RestoreV2:270)]]

brojokm commented 6 years ago

Here is the details

Traceback (most recent call last): File "model.py", line 293, in main() File "model.py", line 258, in main with sv.managed_session(config = config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 275, in prepare_session config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 207, in _restore_checkpoint saver.restore(sess, ckpt.model_checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1775, in restore {self.saver_def.filename_tensor_name: save_path}) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1140, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [91605,300] rhs shape= [91604,300] [[Node: save/Assign_270 = Assign[T=DT_FLOAT, _class=["loc:@word_embeddings"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](word_embeddings, save/RestoreV2:270)]]

Caused by op u'save/Assign_270', defined at: File "model.py", line 293, in main() File "model.py", line 257, in main init_op = model.init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func return func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 318, in init self._init_saver(saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 466, in _init_saver saver = saver_mod.Saver() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1311, in init self.build() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1320, in build self._build(self._filename, build_save=True, build_restore=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1357, in _build build_save=build_save, build_restore=build_restore) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 809, in _build_internal restore_sequentially, reshape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 470, in _AddRestoreOps assign_ops.append(saveable.restore(saveable_tensors, shapes)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 162, in restore self.op.get_shape().is_fully_defined()) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/state_ops.py", line 281, in assign validate_shape=validate_shape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 61, in assign use_locking=use_locking, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1654, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [91605,300] rhs shape= [91604,300] [[Node: save/Assign_270 = Assign[T=DT_FLOAT, _class=["loc:@word_embeddings"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](word_embeddings, save/RestoreV2:270)]]

ghost commented 6 years ago

Also remove "train/train" directory and try it again. Most likely due to the previously saved model.

brojokm commented 6 years ago

I have done after removing the directory..

ghost commented 6 years ago

I'll look into this problem when I get back to my computer.

brojokm commented 6 years ago

I am sorry.. I deleted the folder and created new.. Now its working.. Let see if all works fine.. Kindly do not close the issue till tomorrow.

brojokm commented 6 years ago

Thanks for your supports. @minsangkim142

marc88 commented 5 years ago

I get the following error too after running an embedding layer as;

Embedding(23624, 50, input_length=5, trainable=False)

InvalidArgumentError (see above for traceback): indices[6,4] = 23624 is not in [0, 23624) [[Node: embedding_1/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@embedding_1/embeddings"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast)]]

Each datapoint here is a number(index). Upon checking indices[6,4] I found the following

print(ar_train_data[6,4])
5088

ar_train_data is an array of shape (162896, 5) where each value is between [0, 23624). The training stops towards the end of the first epoch with the error above. am amazed! 5088 is no where out of range for [0, 23624). Can anyone suggest what could be the issue here? Please suggest if additional code snippets are required for clarity. Any help will be much appreciated.

Keras version - 2.2.4 tensorflow version: 1.5.0

Regards

VN04 commented 5 years ago

@marc88 Did you get the solution to the above error you specified? I am getting the same error

tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[8,0] = 8 is not in [0, 8) [[{{node embedding_2/embedding_lookup}} = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_2/embeddings/read, _arg_input_4_0_1, embedding_2/embedding_lookup/axis)]]

ghost commented 5 years ago

Also, same issue here: tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[19,1] = 800 is not in [0, 500) [[Node: embedding_1/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@embedding_1/embeddings"], validate_indices=t rue, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast)]]

omar16100 commented 5 years ago

Similar error here:

InvalidArgumentError: indices[0,0] = 117397 is not in [0, 76616) [[Node: Restaurant-Embedding_1/embedding_lookup = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@training/Adam/Assign_2"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Restaurant-Embedding_1/embeddings/read, Restaurant-Embedding_1/Cast, training/Adam/gradients/Restaurant-Embedding_1/embedding_lookup_grad/concat/axis)]]

Somabhadra commented 5 years ago

i faced similar issue.i have used different vocabulary size and it started working. before i was using glove and got error. so don't used pretrained model.now my code is working fine.basically its related to word embedding issue.