urchade / GLiNER

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
https://arxiv.org/abs/2311.08526
Apache License 2.0
1.48k stars 127 forks source link

What is the maximum size of text input to be processed by the model ? #76

Closed ErwanColombel92 closed 7 months ago

ErwanColombel92 commented 7 months ago

Hi, like said in the title, i was wondering what was the limit in terms of caracters (or tokens ?). Because i never had any warning while putting big portion of texts, but i can see that not everything is taken into account...

Thanks !

urchade commented 7 months ago

It is limited to 384 words (around 512 subtokens)

urchade commented 7 months ago

You have to chunk your text, as the pretrained model I used (deberta) has limited context length

ErwanColombel92 commented 7 months ago

Ok thanks ! It would be great if it had a warning instead of just troncating the text while saying nothing if you want to improve the function !

Have a great day

urchade commented 7 months ago

Thanks for the suggestion. I have added a warning in the newer version!

adantart commented 7 months ago

It is limited to 384 words (around 512 subtokens) deberta

1) I'd like to help if possible to pretrain with a model with higher limit ... is it possible ?

2) also, I'd like to public one specific for legal sector and spanish, tell me how to do it (i have all datasets necessary) and I can publish it ! I've seen the fine-tuning notebook ... but i'd like to train from a bigger model than "urchade/gliner_multi-v2.1", since I think that one is like "medium", right ?