mcavdar / NeuroNER

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
http://neuroner.com
0 stars 0 forks source link

Unknown words - embedding #2

Closed mcavdar closed 6 years ago

mcavdar commented 6 years ago

Big problem. Almost %25 of words are unknown:

number_of_unknown_tokens: 1211
len(token_count['train']): 4593

with Lionel's embedding:

number_of_unknown_tokens: 1593
len(token_count['train']): 4593

big fr word embedding contains 39393 words. http://embeddings.org/frWiki_no_phrase_no_postag_700_cbow_cut100.bin

mcavdar commented 6 years ago

Handling unknown words in language modeling tasks using LSTM How to Train Good Word Embeddings for Biomedical NLP

mcavdar commented 6 years ago

some statistics, unknown words number only in annotated files(train+test+dev):

(if we use big french word embedding) total word count(from ann files):6890 unk word count:6071

(if we use Lionel's word embedding) total word count(from ann files):6890 unk word count:6079

used script: wordembed-fre.txt

mcavdar commented 6 years ago

Word embedding resources: http://fauconnier.github.io/ https://github.com/Kyubyong/wordvectors https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

mcavdar commented 6 years ago

Result of wordembeddingsEM.txt (last one)

total word count(from ann files):6890 unk word count:6105

I trained a new wordembedding(small one) by using only pdfs from medlineplus. [Vocab size: 10732 Words in train file: 309237]

total word count(from ann files):6890 unk word count:5794

mcavdar commented 6 years ago

I made a mistake, so sorry. I forgot to put split after read each word that's why result was too bad. New results:

with wordembeddingcorpus.txt total word count(from ann files):6890 unk word count:2794

with wordembeddingsEM.txt total word count(from ann files):6890 unk word count:2927

with medlineplus+wikifr.txt total word count(from ann files):6890 unk word count:241

So 241 is well enough. Most of them are brand names or proper names like 'Refludan' or 'Tasmar'.

new count script: wordembed-fre.txt

mcavdar commented 6 years ago

I trained a word embedding system with wikifr. And here NeuroNer F1 function plot: f1_conll_vs_epoch_for_all_classes.pdf

Old NeuroNer F1 function plot with small word embedding: f1_conll_vs_epoch_for_all_classes.pdf

Still need to increase F1 score, but unknown words problem is solved !