Closed samarohith closed 2 years ago
Hi, conlleval script calculates F-Score at the phrase level and not at word-level. This means for a named entity phrase say _AB-loc B-I-Loc and if the model tags them as say _AB-Loc B-B-Loc will be considered as incorrect. But sklearn's f1-score calculation looks at the word-level. In the previous example, sklearn considers one word as tagged correctly and the other tagged incorrectly.
You can get word-level f-score using conlleval script by using -r option. This should give the same score as sklearn.
Ok, understood. But in that case, shouldn't sklearn's f-score be higher than the conlleval?
Yes, even I am surprised. Also, could you post the hyper-parameters you are using? The f-score is too low. You are using CoNLL 2002 Spanish NER dataset right?
Yes, I am using the same dataset with CoNLL 2003 English dataset as assisting language. Hyperparameters : num_epochs = 20 batch_size = 1 hidden_size = 300 num_filters = 15 min_filter_width = 1 max_filter_width = 9 learning_rate = 0.4 momentum = 0.01 * learning_rate decay_rate = 0.1 gamma = 0.0 beta = 0.1 schedule = 1 use_gpu = 1 ner_tag_field_l1 = 1 ner_tag_field_l2 = 3
Hi, are you using any pre-trained embeddings?
Yeah, I am using the spectral embeddings mentioned in the readme
The hyper-parameters are reasonable, but the F-Score is pretty low. You are running the NeuralNERYang model or the NeuralNERAllShared version?
Sorry, my bad. It wasn't NERALLShared version. I actually tried experimenting with a different architecture.
Can I run the same experiments using fast text embeddings? I tried for Spanish fast text embeddings, but it's producing an error :
File "\NeuralNERYang\utilsLocal.py", line 19, in load_embeddings vocabulary, wv = zip(*[line.strip().split(' ', 1) for line in f_in])
ValueError: not enough values to unpack (expected 2, got 1)
Hi, yes you can use any pre-trained embeddings. The function expects that every word be present in it's own line along with the embedding. The delimiter expected is space. In one of the lines, splitting based on space is giving only one token.
So what do u suggest me to do? Shall I delete those few lines which cause the error?
Hi, please remove such lines from the embeddings file.
I tried the same for fasttext embeddings(Telugu). I corrected the errors but the problem is with np.loadtext method. Because there are too many words in the vocab, it is using up too much memory and my pc freezes. Can you suggest a better way?
Also, I have a doubt on which f-score metric I should use, micro or macro?
Hi, regarding the np.loadtext
: If you look at my word embedding loading code load_embeddings()
in utilsLocal.py
, i will read from the file line by line and populate the embedding matrix. This gives you more flexibility while loading the embeddings. You can specify restriction on the number of words in the vocabulary in my code (not yet implemented but could be done).
And regarding the f-score metric, I would suggest you to use conlleval.py
script to calculate f-score. This is the standard metric used by everyone to report results. Regarding micro vs macro, the answer in this post Micro-Average vs Macro Average explains it better. The summary is it is dependent on the distribution of examples w.r.t. different classes in your test data.
I have a dataset which has 9 different tags and doesn't follow the BIO scheme. I cannot use the conlleval script for this dataset. So what would you suggest I do ?
You can provide the argument -r
while running the conlleval script. -r
argument ignores the BIO scheme and calculates F-Score at the word-level instead of phrase-level.
I ran the code for Spanish and after 20 epochs, it showed that the test accuracy is 90.8% and F-score is 55%. Then I downloaded the annotated test file(after 19 epochs) and I evaluated the file using sklearn's f1-score. But this time I got the f1-score as 32%, though the accuracy is the same(90.8%). Why is there a difference in f1-scores between them?