urchade / GLiNER

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
https://arxiv.org/abs/2311.08526
Apache License 2.0
1.48k stars 127 forks source link

"IndexError: too many indices for tensor of dimension 1" error when text is empty #63

Closed tom-ph closed 7 months ago

tom-ph commented 7 months ago

Hi, thank you for the great work with this model, it is really appreciated. We are using 0.1.7 version of GLiNER with urchade/gliner_multi-v2.1 model. However, when the text is empty or has only whitespaces the model fails with the following error:

IndexError: too many indices for tensor of dimension 1

This is a problem especially when running the _batch_predictentities method. We can workaround the issue by filtering the texts with a content before feeding them to the model, but it would be definitely easier to avoid doing it.

Thank you!

Below the stacktrace:

IndexError: too many indices for tensor of dimension 1
File <command-3871930245476660>, line 7
----> 7 df["entities"] = basic_cleaner.ner.batch_predict_entities(df[TARGET_COL].tolist(), labels=labels)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gliner/model.py:304, in GLiNER.batch_predict_entities(self, texts, labels, flat_ner, threshold)
    301     all_end_token_idx_to_text_idx.append(end_token_idx_to_text_idx)
    303 input_x = [{"tokenized_text": tk, "ner": None} for tk in all_tokens]
--> 304 x = self.collate_fn(input_x, labels)
    305 outputs = self.predict(x, flat_ner=flat_ner, threshold=threshold)
    307 all_entities = []
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gliner/modules/base.py:115, in InstructBase.collate_fn(self, batch_list, entity_types)
    113     class_to_ids = {k: v for v, k in enumerate(entity_types, start=1)}
    114     id_to_classes = {k: v for v, k in class_to_ids.items()}
--> 115     batch = [
    116         self.preprocess_spans(b["tokenized_text"], b["ner"], class_to_ids) for b in batch_list
    117     ]
    119 span_idx = pad_sequence(
    120     [b["span_idx"] for b in batch], batch_first=True, padding_value=0
    121 )
    123 span_label = pad_sequence(
    124     [el["span_label"] for el in batch], batch_first=True, padding_value=-1
    125 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gliner/modules/base.py:116, in <listcomp>(.0)
    113     class_to_ids = {k: v for v, k in enumerate(entity_types, start=1)}
    114     id_to_classes = {k: v for v, k in class_to_ids.items()}
    115     batch = [
--> 116         self.preprocess_spans(b["tokenized_text"], b["ner"], class_to_ids) for b in batch_list
    117     ]
    119 span_idx = pad_sequence(
    120     [b["span_idx"] for b in batch], batch_first=True, padding_value=0
    121 )
    123 span_label = pad_sequence(
    124     [el["span_label"] for el in batch], batch_first=True, padding_value=-1
    125 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gliner/modules/base.py:45, in InstructBase.preprocess_spans(self, tokens, ner, classes_to_id)
     42 spans_idx = torch.LongTensor(spans_idx)
     44 # mask for valid spans
---> 45 valid_span_mask = spans_idx[:, 1] > length - 1
     47 # mask invalid positions
     48 span_label = span_label.masked_fill(valid_span_mask, -1)
urchade commented 7 months ago

Sorry for the late reply,

I will fix this in the next release

urchade commented 7 months ago

Fixed now

Definelymes commented 7 months ago

I tried the new version (0.1.9) with the fix you added, but still on the batches that all texts in the batch are whitespaces or empty, there would be an error:

shape '[16, 1, 12, 768]' is invalid for input of size 12288
RuntimeError: shape '[16, 1, 12, 768]' is invalid for input of size 12288

the batch of texts that I've paased is:

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

I know it's a corner case, but just to let you know about the existing issue.

urchade commented 7 months ago

Thank you. It should be ok now