tamas-visy / cs4nlp-plmrb

Other
0 stars 0 forks source link

Check maximum sentence length #14

Closed tamas-visy closed 1 month ago

tamas-visy commented 5 months ago

PLMs seem to tokenize text input by truncating overly long sentences

https://github.com/tamas-visy/cs4nlp-plmrb/blob/5f22c5de952018fc075a57c98f76eff03ae5683c/src/models/language_model.py#L76

We should check if any of the sentences are close to or above the limit we use.

(Currently, that is 512 [tokens?]).