guillaumegenthial / tf_ner

Simple and Efficient Tensorflow implementations of NER models with tf.estimator and tf.data
Apache License 2.0
923 stars 275 forks source link

0 precision 0 recall for some custom tags #55

Closed gui-li closed 5 years ago

gui-li commented 5 years ago

I tried to run my own dataset with your lstm_crf code, but some of the tags got 0 precision and 0 recall. Specifically, it's the phone_number set and email set. So I was wondering what happen. Is that the problem of embedding? Cause I noticed phone numbers and emails are not in the glove.840B.300d.txt embedding vectors. However, what should I do to effectively train these two tags instead of 0? Thank you for your help ahead.

gui-li commented 5 years ago

When I trained the lstm_crf model only with Phone_number a and email tag, the recall and precision is not 0. But I want to predict all labels together at once so that I can compare it with other model.

gui-li commented 5 years ago

Yes, it's the problem of word embedding. Some of the out-of-vocabulary words will be set to tensor of 0 and they will likely be ignored by the network while the word embedding trainning option is set to False. However, there is some solutions here:

  1. You can choose other models with character embedding, but all of the models provided with character embedding are concatenated with word embedding. The problem of matrix of 0 still exist. But releive by the character embedding.
  2. You can train your own embedding with your own corpus. However, it's a nightmare if you don't have enough time or computational resources.
  3. You can use some out-of-the-box oov tools. They can provide you embedding of those out-of-vocabulary words by their own methods. One of them use the context of oov to build oov's embedding matrix.
  4. Like the first method, but different. You can build your own vocabulary in any forms. charactor/ngram are both good choice. Then the oov problem will not exist any more if you don't persist to use word embedding again. That's my solution for this issue, if anything wrong in my words or you guys have some other solution, please point it out.