Maximum sequence length tokenizer

iPieter / RobBERT

A Dutch RoBERTa-based language model

https://pieter.ai/robbert/

MIT License

196 stars 29 forks source link

Maximum sequence length tokenizer #28

Closed danieldk closed 2 years ago

danieldk commented 2 years ago

It would be nice if model_max_length could be set in the tokenizer configuration. If this is not set, the maximum length as input to the transformer model will be set to VERY_LARGE_INTEGER (1e30):

https://huggingface.co/transformers/main_classes/tokenizer.html#pretrainedtokenizer

This then leads to an exception in the embedding lookup, because the model will attempt to index the position embedings with positions >= 512.

iPieter commented 2 years ago

We fixed this in #e28720. Sorry for the delay.