huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.7k stars 26.44k forks source link

Tokenizer discard data that exceed max_length #31627

Open fengyunflya opened 3 months ago

fengyunflya commented 3 months ago

Feature request

When use tokenizer, it truncate data to max_length, but can't just discard the data.

Motivation

Sometimes we want the sentence to be complete

Your contribution

No

seanswyi commented 3 months ago

To clarify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?

fengyunflya commented 3 months ago

To specify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?

For example, I have a sentence that maybe exceed the max length, but I have to encode it to make sure that, So If I pre-processing it, and I have a lot of data that would waste time, cause I have to batch tokenize them later. If there is param in the tokenizer method, when the encoded sentence exceed max_length, it could just discard the sentence, that would only encode sentence once.

amyeroberts commented 3 months ago

cc @ArthurZucker

ArthurZucker commented 1 month ago

Hey! This has not been requested much, would recommend doing this manually in your data collator for example: you encode first everything, discard what's to long then pad !