neuralmind-ai / portuguese-bert

Portuguese pre-trained BERT models
Other
792 stars 122 forks source link

Error training for instances with only numbers. #46

Open romualdoalan opened 11 months ago

romualdoalan commented 11 months ago

I found an error in the code that is related to an output length issue in the get_example_output function in the postprocessing.py file. The specific error is an AssertionError that occurs when the code tries to verify whether the length of the output (complete_output) matches the length of the document tokens for an example in which I only have numbers.

Just with instances like that the assertion failed.

{
    "doc_id": "TEST-205",
    "doc_text": "3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960",
    "entities": [
      {
        "entity_id": 0,
        "text": "3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960",
        "label": "NUMEROS_OUTROS",
        "start_offset": 0,
        "end_offset": 54
      }
    ]
  }

Maybe you can give me some insight. Thank you.

The error:

File "D:\Anonimização\NER\postprocessing.py", line 157, in get_example_output
    assert len(complete_output) == len(self.examples[example_ix].doc_tokens), \
AssertionError: Length mismatch for example 169: [ 0  0  0  3  4  4  4  4  9 10 10 10 10 10  9 10 10 10 10 10] !=
             11 in example 169:

doc_id: TEST-205
orig_text:3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960
doc_tokens: [Token(text='3123', offset=0, index=0, tail=' ', tag=None), Token(text='0346', offset=5, index=1, tail=' ', tag=None), Token(text='2154', offset=10, index=2, tail=' ', tag=None), Token(text='8600', offset=15, index=3, tail=' ', tag=None), Token(text='0186', offset=20, index=4, tail=' ', tag=None), Token(text='5500', offset=25, index=5, tail=' ', tag=None), Token(text='1000', offset=30, index=6, tail=' ', tag=None), Token(text='0001', offset=35, index=7, tail=' ', tag=None), Token(text='6015', offset=40, index=8, tail=' ', tag=None), Token(text='3585', offset=45, index=9, tail=' ', tag=None), Token(text='0960', offset=50, index=10, tail='', tag=None)]

labels: ['B-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS']

tags: [NETag(doc_id='HAREM-205', entity_id=0, text='3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960', type='NUMEROS_OUTROS', start_position=0, end_position=10)]

[array([ 0,  0,  0,  3,  4,  4,  4,  4,  9, 10, 10, 10, 10, 10,  9, 10, 10, 10, 10, 10])]

@fabiocapsouza @rodrigonogueira4

fabiocapsouza commented 11 months ago

Hi @romualdoalan , could you provide a script or a notebook to reproduce this error please? Also, did you make modifications to the postprocessing functions? I am asking because the error message you posted mentions line 157 of postprocessing.py, but the original file only has 146 lines

romualdoalan commented 11 months ago

Hello @fabiocapsouza , I cloned again the clean repo in my machine to detect this errors, and the things changed.

When i just train with my all data, i had problems with this type of instances:

{

        "doc_id": "TEST-645",
        "doc_text": "Whatsapp11987654321",
        "entities": [
            {
                "entity_id": 0,
                "text": "11987654321",
                "label": "TELEFONE_CELULAR",
                "start_offset": 8,
                "end_offset": 19
            }
        ]
    }

The loggin error that I've receive is:

Traceback (most recent call last):
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\run_bert_harem.py", line 129, in <module>
    main(load_and_cache_examples,
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\trainer.py", line 686, in main
    train_dataset, train_examples, train_features = load_and_cache_examples_fn(
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\run_bert_harem.py", line 63, in load_and_cache_examples
    examples = read_examples(
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\preprocessing.py", line 133, in read_examples
    assert entity_text.strip() == reconstructed_text, \
AssertionError: Entity text and reconstructed text are not equal: 11987654321 != Whatsapp11987654321

I had commented on this assert test and walla! it works, but it broke with the first one mentioned in this issue:

 {
    "doc_id": "HAREM-359",
    "doc_text": "35230143876569000128550010000000111202621491",
    "entities": [
      {
        "entity_id": 0,
        "text": "35230143876569000128550010000000111202621491",
        "label": "NUMEROS_OUTROS",
        "start_offset": 0,
        "end_offset": 44
      }
    ]
  },

and i get this error:

Traceback (most recent call last):
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\run_bert_harem.py", line 129, in <module>
    main(load_and_cache_examples,
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\trainer.py", line 737, in main
    train(
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\trainer.py", line 286, in train
    trn_epoch_metrics = evaluate(
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\trainer.py", line 427, in evaluate
    y_pred = output_composer.get_outputs()
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\postprocessing.py", line 143, in get_outputs      
    example_output = self.get_example_output(example_ix)
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\postprocessing.py", line 129, in get_example_output
    complete_output = concatenate(example_partial_outputs)
  File "G:\Projetos\ner\portuguese-bert\ner_evaluation\postprocessing.py", line 63, in concatenate       
    if isinstance(list_tensors[0], np.ndarray):
IndexError: list index out of range

I coudnt do a nootebook to you but I attached the files ("dataset-traina.json" and "classes-total.txt") runnning by (below) command in the same repository.

python run_bert_harem.py ^
    --bert_model models/distilbert-base-multilingual-cased ^ (or other model, just for testing)
    --labels_file data/classes-total.txt ^
    --do_train ^
    --train_file data/dataset-traina.json ^
    --valid_file data/dataset-traina.json ^
    --num_train_epochs 1 ^
    --per_gpu_train_batch_size 2 ^
    --gradient_accumulation_steps 8 ^
    --do_eval ^
    --eval_file data/dataset-traina.json ^
    --output_dir ner_models/distilbert_1_epoch

When i remove the second exemple, the training works. 🥵

Any idea what it could be? I really tried to change things but in the default repo I don't know why this is happening

I appreciate your time. Thank you!

data-example.zip

fabiocapsouza commented 11 months ago

Hi @romualdoalan, this entity

{

        "doc_id": "TEST-645",
        "doc_text": "Whatsapp11987654321",
        "entities": [
            {
                "entity_id": 0,
                "text": "11987654321",
                "label": "TELEFONE_CELULAR",
                "start_offset": 8,
                "end_offset": 19
            }
        ]
    }

probably fails because this implementation of NER is not able to handle entities that are shorter than 1 "word". The reason is that the tokenization process first tokenizes the text into "words" by splitting at whitespaces and punctuations, and then it applies WordPiece tokenization to each word. In this example, we have only 1 word Whatsapp11987654321, and it will be tokenized into several wordpieces, e.g. What #s #app #11 #987 #65 #4321 (just an example). By design, BERT will only predict entity labels for the first wordpiece of each word (in this case What) and ignore all inner wordpieces (that start with #). So the assertions are probably catching this issue.

For the second problem with the longer number, it could be an edge case due to the entity comprising the whole doc_text? Just guessing, but you could try adding some more text in doc_text to see if the problem goes away

romualdoalan commented 11 months ago

I really had to continue knowing these difficulties, I will be looking for solutions and then attach them here, something was broken but I managed to continue without these samples. Thank you very much for your time! Thanks