Closed mcavdar closed 6 years ago
some statistics, unknown words number only in annotated files(train+test+dev):
(if we use big french word embedding) total word count(from ann files):6890 unk word count:6071
(if we use Lionel's word embedding) total word count(from ann files):6890 unk word count:6079
used script: wordembed-fre.txt
Result of wordembeddingsEM.txt (last one)
total word count(from ann files):6890 unk word count:6105
I trained a new wordembedding(small one) by using only pdfs from medlineplus. [Vocab size: 10732 Words in train file: 309237]
total word count(from ann files):6890 unk word count:5794
I made a mistake, so sorry. I forgot to put split after read each word that's why result was too bad. New results:
with wordembeddingcorpus.txt total word count(from ann files):6890 unk word count:2794
with wordembeddingsEM.txt total word count(from ann files):6890 unk word count:2927
with medlineplus+wikifr.txt total word count(from ann files):6890 unk word count:241
So 241 is well enough. Most of them are brand names or proper names like 'Refludan' or 'Tasmar'.
new count script: wordembed-fre.txt
I trained a word embedding system with wikifr. And here NeuroNer F1 function plot: f1_conll_vs_epoch_for_all_classes.pdf
Old NeuroNer F1 function plot with small word embedding: f1_conll_vs_epoch_for_all_classes.pdf
Still need to increase F1 score, but unknown words problem is solved !
Big problem. Almost %25 of words are unknown:
with Lionel's embedding:
big fr word embedding contains 39393 words. http://embeddings.org/frWiki_no_phrase_no_postag_700_cbow_cut100.bin