strubell / LISA

Linguistically-Informed Self-Attention implemented in TensorFlow
Apache License 2.0
201 stars 27 forks source link

Invalid argument error during training #2

Open acDante opened 5 years ago

acDante commented 5 years ago

Hello Ms.Strubell :-) I am trying to train and evaluate your LISA model on CoNLL05 dataset. I followed the recipe in this post https://github.com/strubell/preprocess-conll05 for preprocesing ConLL2005 dataset and I have adapted the data path in configuration file correspondingly. When I run the training, the initialization steps of tensorflow model seem to work normally. But after "filling up the shuffle buffer" , I got following error information immediately.. Do you have any ideas about the reason of this error? And could you have any pretrained models on CoNLL05 dataset ?

2018-10-18 23:39:20.446629: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:135] Shuffle buffer filled. Traceback (most recent call last): File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[862] = 5199 is not in [0, 1968) [[Node: LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@LISA/Nadam/update_LISA/word_type_embeddings/embeddings/ScatterAdd"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](LISA/Nadam/update_LISA/word_type_embeddings/embeddings/add_1, LISA/Nadam/update_LISA/word_type_embeddings/embeddings/Unique, LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3/axis)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "src/train.py", line 143, in tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 451, in train_and_evaluate return executor.run() File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 590, in run return self.run_local() File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 691, in run_local saving_listeners=saving_listeners) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 376, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1145, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1173, in _train_model_default saving_listeners) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1451, in _train_with_estimatorspec , loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 583, in run run_metadata=run_metadata) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1059, in run run_metadata=run_metadata) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1150, in run raise six.reraise(original_exc_info) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1135, in run return self._sess.run(args, *kwargs) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1207, in run run_metadata=run_metadata) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 987, in run return self._sess.run(args, **kwargs) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run run_metadata_ptr) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run feed_dict_tensor, options, run_metadata) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run run_metadata) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[862] = 5199 is not in [0, 1968) [[Node: LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@LISA/Nadam/update_LISA/word_type_embeddings/embeddings/ScatterAdd"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](LISA/Nadam/update_LISA/word_type_embeddings/embeddings/add_1, LISA/Nadam/update_LISA/word_type_embeddings/embeddings/Unique, LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3/axis)]]

Caused by op 'LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3', defined at: File "src/train.py", line 143, in tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 451, in train_and_evaluate return executor.run() File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 590, in run return self.run_local() File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 691, in run_local saving_listeners=saving_listeners) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 376, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1145, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1170, in _train_model_default features, labels, model_fn_lib.ModeKeys.TRAIN, self.config) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1133, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "/Users/xiaotang/Documents/SRL/LISA/src/model.py", line 294, in model_fn train_op = optimizer.apply_gradients(zip(gradients, variables), global_step=tf.train.get_global_step()) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/contrib/optimizer_v2/optimizer_v2.py", line 866, in apply_gradients self._distributed_apply, filtered, global_step=global_step, name=name) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 1053, in merge_call return self._merge_call(merge_fn, *args, *kwargs) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 1060, in _merge_call return merge_fn(self._distribution_strategy, args, kwargs) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/contrib/optimizer_v2/optimizer_v2.py", line 964, in _distributed_apply var, update, grad))) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 868, in update return self._update(var, fn, *args, *kwargs) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 1144, in _update return fn(var, args, kwargs) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/contrib/optimizer_v2/optimizer_v2.py", line 958, in update return processor.update_op(self, g, state) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/contrib/optimizer_v2/optimizer_v2.py", line 81, in update_op return optimizer._apply_sparse_duplicate_indices(g, self._v, args) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/contrib/optimizer_v2/optimizer_v2.py", line 1204, in _apply_sparse_duplicate_indices return self._apply_sparse(gradient_no_duplicate_indices, var, state) File "/Users/xiaotang/Documents/SRL/LISA/src/lazy_adam_v2.py", line 228, in _apply_sparse state) File "/Users/xiaotang/Documents/SRL/LISA/src/lazy_adam_v2.py", line 212, in _apply_sparse_shared m_bar_slice = array_ops.gather(m_bar, indices) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2659, in gather return gen_array_ops.gather_v2(params, indices, axis, name=name) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3142, in gather_v2 "GatherV2", params=params, indices=indices, axis=axis, name=name) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func return func(args, kwargs) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op op_def=op_def) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[862] = 5199 is not in [0, 1968) [[Node: LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@LISA/Nadam/update_LISA/word_type_embeddings/embeddings/ScatterAdd"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](LISA/Nadam/update_LISA/word_type_embeddings/embeddings/add_1, LISA/Nadam/update_LISA/word_type_embeddings/embeddings/Unique, LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3/axis)]]

strubell commented 5 years ago

This looks like an optimizer bug. I just tested the master branch w/ tensorflow v1.9 and 1.10 on gpu and can't replicate. What version of tensorflow are you using, and are you running on gpu or cpu?

acDante commented 5 years ago

Thanks for the help, Ms.Strubell :-) My environment is: python 3.6.6, tensorflow 1.10.1 and this error occurred when I run on cpu. I also doubt the format of my processed CoNLL05 data is incorrect, since it looks a bit different from the examples mentioned in your repository ( I set the path to WSJ testset as: /treebank2/combined/wsj, and I did not find a valid path to Brown test set... )

strubell commented 5 years ago

It could be a cpu-specific issue, or it could be the data formatting (or both). Is it possible for you to try running on a gpu?

I don't think this specific error is caused by the data format, but that could also be a separate issue. Can you paste a few example lines of your pre-processed data? It should look exactly like the example in the data preprocessing repo here: https://github.com/strubell/preprocess-conll05#further-pre-processing-eg-for-lisa

On Mon, Oct 29, 2018 at 11:05 AM acDante notifications@github.com wrote:

Thanks for the help, Ms.Strubell :-) My environment is: python 3.6.6, tensorflow 1.10.1 and this error occurred when I run on cpu. I also doubt the format of my processed CoNLL05 data is incorrect, since it looks a bit different from the examples mentioned in your repository ( I set the path to WSJ testset as: /treebank2/combined/wsj, and I did not find a valid path to Brown test set... )

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/strubell/LISA/issues/2#issuecomment-433944931, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHZtzDt19SfUmE0whmJAXe1-Egsaxeiks5upxlVgaJpZM4XvV9V .

acDante commented 5 years ago

My training dataset looks correct to me: conll05 141 0 They PRP PRP 2 nsubj - - - - O B-A0 conll05 141 1 attached VBD VBD 0 root 01 attach - - O B-V conll05 141 2 a DT DT 5 det - - - - O B-A1 conll05 141 3 second JJ JJ 5 amod - - - - O I-A1 conll05 141 4 gene NN NN 2 dobj - - - - O I-A1 conll05 141 5 , , , 5 punct - - - - O I-A1 conll05 141 6 for IN IN 5 prep - - - - O I-A1 conll05 141 7 herbicide NN NN 9 nn - - - - O I-A1 conll05 141 8 resistance NN NN 7 pobj - - - - O I-A1 conll05 141 9 , , , 5 punct - - - - O I-A1 conll05 141 10 to TO TO 2 prep - - - - O B-A1 conll05 141 11 the DT DT 14 det - - - - O I-A1 conll05 141 12 pollen-inhibiting JJ JJ 14 amod - - - - O I-A1 conll05 141 13 gene NN NN 11 pobj - - - - O I-A1 conll05 141 14 . . . 2 punct _ - - - - O O

conll05 142 0 Both DT DT 2 det - - - - O B-A1 O O O O conll05 142 1 genes NNS NNS 5 nsubjpass - - - - O I-A1 O O O O conll05 142 2 are VBP VBP 5 auxpass - - - - O O O O O O conll05 142 3 then RB RB 5 advmod - - - - O B-AM-TMP O O O O conll05 142 4 inserted VBN VBN 0 root 01 insert - - O B-V O O O O conll05 142 5 into IN IN 5 prep - - - - O B-A2 O O O O conll05 142 6 a DT DT 10 det - - - - O I-A2 B-A1 B-A1 B-A1 B-A0 conll05 142 7 few JJ JJ 10 amod - - - - O I-A2 I-A1 I-A1 I-A1 I-A0 conll05 142 8 greenhouse NN NN 10 nn - - - - O I-A2 I-A1 I-A1 I-A1 I-A0 conll05 142 9 plants NNS NNS 6 pobj - - - - O I-A2 I-A1 I-A1 I-A1 I-A0 conll05 142 10 , , , 10 punct - - - - O I-A2 O O O O conll05 142 11 which WDT WDT 15 nsubjpass - - - - O I-A2 B-R-A1 B-C-A1 B-R-A1 B-R-A0 conll05 142 12 are VBP VBP 15 auxpass - - - - O I-A2 O O O O conll05 142 13 then RB RB 15 advmod - - - - O I-A2 B-AM-TMP B-AM-TMP O O conll05 142 14 pollinated VBN VBN 10 rcmod 01 pollinate - - O I-A2 B-V O O O conll05 142 15 and CC CC 15 cc - - - - O I-A2 O O O O conll05 142 16 allowed VBN VBN 15 conj 01 allow - - O I-A2 O B-V O O conll05 142 17 to TO TO 19 aux - - - - O I-A2 O B-C-A1 O O conll05 142 18 mature VB VB 17 xcomp 01 mature - - O I-A2 O I-C-A1 B-V O conll05 142 19 and CC CC 19 cc - - - - O I-A2 O I-C-A1 O O conll05 142 20 produce VB VB 19 conj 01 produce - - O I-A2 O I-C-A1 O B-V conll05 142 21 seed NN NN 21 dobj - - - - O I-A2 O I-C-A1 O B-A1 conll05 142 22 . . . 5 punct _ - - - - O O O O O O

But my WSJ test set only contains these columns: conll05 7 0 DT - -
conll05 7 1 NN - -
conll05 7 2 VBZ - -
conll05 7 3 RB - -
conll05 7 4 VBN - -
conll05 7 5 . - -

conll05 8 0 `` - -
conll05 8 1 DT - -
conll05 8 2 NN - -
conll05 8 3 NN - -
conll05 8 4 VBD - -
conll05 8 5 JJ - -
conll05 8 6 . - -

Is this data format correct ? I will also try running the experiments on GPU later to see :-) Thanks for giving the advice!

Impavidity commented 5 years ago

@acDante Do you fix this problem ? I got exactly same error as you.

strubell commented 5 years ago

You need to add some dummy parse/srl labels to the test set, as right now the code expects to evaluate with respect to gold labels.

On Thu, Jan 17, 2019 at 7:41 PM Peng Shi notifications@github.com wrote:

@acDante https://github.com/acDante Do you fix this problem ? I got exactly same error as you.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/strubell/LISA/issues/2#issuecomment-455385230, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHZt-F_S3FOHl12W-Kkdr-xuSbN4Hvjks5vERhFgaJpZM4XvV9V .

Impavidity commented 5 years ago

I think my test data is in same format with training and dev (with parse and srl info)

strubell commented 5 years ago

Did you try w/ tf 1.9 or 1.10 on gpu?

On Thu, Jan 17, 2019 at 7:46 PM Peng Shi notifications@github.com wrote:

I think my test data is in same for format with training and dev (with parse and srl info)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/strubell/LISA/issues/2#issuecomment-455386060, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHZt0BKlXoQY7uP3d-ovVP3vtPPCUnDks5vERlNgaJpZM4XvV9V .

Impavidity commented 5 years ago

yeah. It gives me segmentation fault. No error message at all.

strubell commented 5 years ago

Is there any output before the segfault, and are you sure that your tensorflow installation otherwise works?

On Thu, Jan 17, 2019 at 7:57 PM Peng Shi notifications@github.com wrote:

yeah. It gives me segmentation fault. No error message at all.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/strubell/LISA/issues/2#issuecomment-455388372, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHZt14q7a2Ma3DFN-06Vz3-vL3tZyTWks5vERwLgaJpZM4XvV9V .

Impavidity commented 5 years ago
2019-01-17 20:24:31.656029: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:135] Shuffle buffer filled.
bin/train.sh: line 20: 18297 Segmentation fault      (core dumped) python3 src/train.py --train_files $train_files --dev_files $dev_files --transition_stats $transition_stats --data_config $data_config --model_configs $model_configs --task_configs $task_configs --layer_configs $layer_configs --attention_configs $attention_configs $params
acDante commented 5 years ago

Hi impavidily, I managed to fix this error by downgrading my Tensorflow to 1.9.0. I guess this results from incompatible Tensorflow version with cuda. What is your cuda and cuDNN version ?

Impavidity commented 5 years ago

I both tried 1.9.0 and 1.10.0 with cuda 9.0 and cuDNN 7

Impavidity commented 5 years ago

I think you are right. There might be some incompatible issue here. @acDante @strubell Thank you all.