Evaluation inconsistency

alcidesmig commented 3 years ago

Hi, all!

I'm using your code from ./ner_evaluation to eval my NER model, trained with your code. I'm having troubles regarding the size of the dataset at the input time and after the eval. Exploring the dataset, I can see that I have 75 B-TAG (using BIO), being TAG one of my tags, and 1 I-TAG. After the eval, I got:

             precision  recall    f1-score   support
 TAG           0.7547    0.7692     0.7619        61

The support is 61 != 75. I printed out the confusion matrix, and the sum of the lines of the TAG (B-TAG + I-TAG) doesn't are the expected (62 != 75).

Can anyone know what it could be? I'm investigating the tokenizer, but I think it's not it.

Thank you for your attention! :)

alcidesmig commented 3 years ago

I'm investigating here... in preprocessing.read_examples if I count the occurrences of TAG doing:

elif scheme == 'BIO':
    # BIO scheme
    for token_index in range(start_token, end_token + 1):
        if token_index == start_token:
            tag = 'B-' + entity_type
            if entity_type == TAG:
                cont += 1
        else:
            tag = 'I-' + entity_type
        set_label(token_index, tag)

The cont in the final is correct (75), but when I reconstruct the data by putting above the return:

distrib = {}
for i in examples:
    for j in i.labels:
        if j not in distrib:
            distrib[j] = 0
        distrib[j] += 1

I got distrib[TAG] = 61, being TAG like 'B-TAG'.

alcidesmig commented 3 years ago

I discovered it was an error with my dataset (2 or + same entities in the same token), thank you for your attention!

neuralmind-ai / portuguese-bert

Evaluation inconsistency #25