Open fengyunflya opened 3 months ago
To clarify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?
To specify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?
For example, I have a sentence that maybe exceed the max length, but I have to encode it to make sure that, So If I pre-processing it, and I have a lot of data that would waste time, cause I have to batch tokenize them later. If there is param in the tokenizer method, when the encoded sentence exceed max_length, it could just discard the sentence, that would only encode sentence once.
cc @ArthurZucker
Hey! This has not been requested much, would recommend doing this manually in your data collator for example: you encode first everything, discard what's to long then pad !
Feature request
When use tokenizer, it truncate data to max_length, but can't just discard the data.
Motivation
Sometimes we want the sentence to be complete
Your contribution
No