urchade / GLiNER

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
https://arxiv.org/abs/2311.08526
Apache License 2.0
1.32k stars 111 forks source link

Train on data without entities #139

Open AnnaKholkina opened 3 months ago

AnnaKholkina commented 3 months ago

Hi. I want to finetune a model on data where some of them do not contain entities (so that there is less fp). I tried to do it with such examples in the dataset: {'tokenized_text': ['In', 'this', 'year', '.'], 'ner': []}, And I have an error:

Traceback (most recent call last):
  File "/home/jovyan/work/dev/ner/gliner/GLiNER/examples/finetuning/finetune-balanced-remove-short-orgs-empty-ner.py", line 59, in <module>
    trainer.train(num_epochs=25)
  File "/home/jovyan/work/dev/ner/gliner/GLiNER/examples/finetuning/trainer.py", line 213, in train
    total_loss = self.model(batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/gliner/model.py", line 141, in forward
    logits_label = scores.view(-1, num_classes)
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

Or this format: {'tokenized_text': ['In', 'this', 'year', '.'], 'ner': [[]]}, And error:

Traceback (most recent call last):
  File "/home/jovyan/work/dev/ner/gliner/GLiNER/examples/finetuning/finetune-balanced-remove-short-orgs-empty-ner.py", line 59, in <module>
    trainer.train(num_epochs=25)
  File "/home/jovyan/work/dev/ner/gliner/GLiNER/examples/finetuning/trainer.py", line 208, in train
    for batch_idx, batch in progress_bar:
  File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 464, in __iter__
    next_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 678, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.10/dist-packages/gliner/modules/data.py", line 83, in <lambda>
    return DataLoader(data, collate_fn=lambda x: self.collate_fn(x, entity_types), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/gliner/modules/data.py", line 67, in collate_fn
    class_to_ids, id_to_classes = self.batch_generate_class_mappings(batch_list)
  File "/usr/local/lib/python3.10/dist-packages/gliner/modules/data.py", line 42, in batch_generate_class_mappings
    negs = self.get_negatives(batch_list, 100)
  File "/usr/local/lib/python3.10/dist-packages/gliner/modules/data.py", line 34, in get_negatives
    types = set([el[-1] for el in b['ner']])
  File "/usr/local/lib/python3.10/dist-packages/gliner/modules/data.py", line 34, in <listcomp>
    types = set([el[-1] for el in b['ner']])
IndexError: list index out of range

Is there any way to fix this?

urchade commented 3 months ago

You cannot train the model without any entity types. The model needs entity types to compute de matching scores.

you can pre-define the list of labels under the key "label", if the list of named entities is empty:

{'tokenized_text': ['In', 'this', 'year', '.'], 'ner': [], 'label': ["person", "org"]}