murthyrudra / NeuralNER

Implementation of Multilingual Neural NER
GNU General Public License v3.0
5 stars 2 forks source link

Difference in evaluation scores #9

Closed samarohith closed 2 years ago

samarohith commented 4 years ago

I ran the code for Spanish and after 20 epochs, it showed that the test accuracy is 90.8% and F-score is 55%. Then I downloaded the annotated test file(after 19 epochs) and I evaluated the file using sklearn's f1-score. But this time I got the f1-score as 32%, though the accuracy is the same(90.8%). Why is there a difference in f1-scores between them?

murthyrudra commented 4 years ago

Hi, conlleval script calculates F-Score at the phrase level and not at word-level. This means for a named entity phrase say _AB-loc B-I-Loc and if the model tags them as say _AB-Loc B-B-Loc will be considered as incorrect. But sklearn's f1-score calculation looks at the word-level. In the previous example, sklearn considers one word as tagged correctly and the other tagged incorrectly.

You can get word-level f-score using conlleval script by using -r option. This should give the same score as sklearn.

samarohith commented 4 years ago

Ok, understood. But in that case, shouldn't sklearn's f-score be higher than the conlleval?

murthyrudra commented 4 years ago

Yes, even I am surprised. Also, could you post the hyper-parameters you are using? The f-score is too low. You are using CoNLL 2002 Spanish NER dataset right?

samarohith commented 4 years ago

Yes, I am using the same dataset with CoNLL 2003 English dataset as assisting language. Hyperparameters : num_epochs = 20 batch_size = 1 hidden_size = 300 num_filters = 15 min_filter_width = 1 max_filter_width = 9 learning_rate = 0.4 momentum = 0.01 * learning_rate decay_rate = 0.1 gamma = 0.0 beta = 0.1 schedule = 1 use_gpu = 1 ner_tag_field_l1 = 1 ner_tag_field_l2 = 3

murthyrudra commented 4 years ago

Hi, are you using any pre-trained embeddings?

samarohith commented 4 years ago

Yeah, I am using the spectral embeddings mentioned in the readme

murthyrudra commented 4 years ago

The hyper-parameters are reasonable, but the F-Score is pretty low. You are running the NeuralNERYang model or the NeuralNERAllShared version?

samarohith commented 4 years ago

Sorry, my bad. It wasn't NERALLShared version. I actually tried experimenting with a different architecture.

samarohith commented 4 years ago

Can I run the same experiments using fast text embeddings? I tried for Spanish fast text embeddings, but it's producing an error :

File "\NeuralNERYang\utilsLocal.py", line 19, in load_embeddings vocabulary, wv = zip(*[line.strip().split(' ', 1) for line in f_in])

ValueError: not enough values to unpack (expected 2, got 1)

murthyrudra commented 4 years ago

Hi, yes you can use any pre-trained embeddings. The function expects that every word be present in it's own line along with the embedding. The delimiter expected is space. In one of the lines, splitting based on space is giving only one token.

samarohith commented 4 years ago

So what do u suggest me to do? Shall I delete those few lines which cause the error?

murthyrudra commented 4 years ago

Hi, please remove such lines from the embeddings file.

samarohith commented 4 years ago

I tried the same for fasttext embeddings(Telugu). I corrected the errors but the problem is with np.loadtext method. Because there are too many words in the vocab, it is using up too much memory and my pc freezes. Can you suggest a better way?

samarohith commented 4 years ago

Also, I have a doubt on which f-score metric I should use, micro or macro?

murthyrudra commented 4 years ago

Hi, regarding the np.loadtext: If you look at my word embedding loading code load_embeddings() in utilsLocal.py, i will read from the file line by line and populate the embedding matrix. This gives you more flexibility while loading the embeddings. You can specify restriction on the number of words in the vocabulary in my code (not yet implemented but could be done).

And regarding the f-score metric, I would suggest you to use conlleval.py script to calculate f-score. This is the standard metric used by everyone to report results. Regarding micro vs macro, the answer in this post Micro-Average vs Macro Average explains it better. The summary is it is dependent on the distribution of examples w.r.t. different classes in your test data.

samarohith commented 4 years ago

I have a dataset which has 9 different tags and doesn't follow the BIO scheme. I cannot use the conlleval script for this dataset. So what would you suggest I do ?

murthyrudra commented 4 years ago

You can provide the argument -r while running the conlleval script. -r argument ignores the BIO scheme and calculates F-Score at the word-level instead of phrase-level.