guillaumegenthial / sequence_tagging

Named Entity Recognition (LSTM + CRF) - Tensorflow
https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html
Apache License 2.0
1.94k stars 703 forks source link

Unknow key is not allowed. Check that your vocab (tags?) is correct #63

Closed mrgonext closed 6 years ago

mrgonext commented 6 years ago

Hi, Firstly, thank you for your sharing codes and instructions I'm trying to run with CoNLL2003 I've downloaded from here https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003 and changed the config:

` filename_dev = "data/coNLL/eng/eng.testa" filename_test = "data/coNLL/eng/eng.testb" filename_train = "data/coNLL/eng/eng.train"

#filename_dev = filename_test = filename_train = "data/test.txt" # test`

build data look good:

`python build_data.py Building vocab...

Then trains it, but got the exception

python train.py WARNING:tensorflow:From /SourceCode/keras/sequence_tagging/env/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Use the retry module or similar alternatives. From /SourceCode/keras/sequence_tagging/env/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Use the retry module or similar alternatives. /SourceCode/keras/sequence_tagging/env/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " Initializing tf session 2018-07-14 11:34:02.373966: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Epoch 1 out of 15 Traceback (most recent call last): File "train.py", line 26, in <module> main() File "train.py", line 23, in main model.train(train, dev) File "/SourceCode/keras/sequence_tagging/model/base_model.py", line 121, in train score = self.run_epoch(train, dev, epoch) File "/SourceCode/keras/sequence_tagging/model/ner_model.py", line 278, in run_epoch nbatches = (len(train) + batch_size - 1) // batch_size File "/SourceCode/keras/sequence_tagging/model/data_utils.py", line 88, in __len__ for _ in self: File "/SourceCode/keras/sequence_tagging/model/data_utils.py", line 79, in __iter__ tag = self.processing_tag(tag) File "/SourceCode/keras/sequence_tagging/model/data_utils.py", line 274, in f raise Exception("Unknow key is not allowed. Check that "\ Exception: Unknow key is not allowed. Check that your vocab (tags?) is correct

Opening tags.txt it looks strange data-line-number="30666"></td> data-line-number="424"></td> data-line-number="30258"></td> data-line-number="28747"></td> data-line-number="46862"></td> data-line-number="50137"></td> data-line-number="26256"></td> Attached tags.txt here: tags.txt

Could you help me point me how to resolve this issue? I'm not sure if I'm missing something.

thank you.

My Environment Details: MacOS high seirra 10.13.1 Python 3.6 Tensorflow 1.7.0

joelthe1 commented 6 years ago

I had the same issue but figured out that in my case it was because the data set was 'tab' separated. The data processor in the code expects 'whitespace' to the be the field separator. If this is you, in model/data_utils.py, in the overridden function __iter__ change line ls = line.split(' ') to ls = line.split('\t')

mrgonext commented 6 years ago

Thank you for your help. I've figured out the issue. It was because eng.testa file is not good by downloaded wrong way. I've downloaded again and it worked. any way thank you.