huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

Fix truncation length assertion #1382

Closed boyleconnor closed 7 months ago

boyleconnor commented 8 months ago

Fixes #1326

I haven't seen any explanation from @Narsil why the line in question was changed, so I just went ahead and made this PR reversing it.

I've tested my version of this locally and can confirm that there are no longer two different errors at different places for slightly different invalid stride values:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('bert-base-cased')
tokenizer.enable_truncation(max_length=10, stride=9)  # This line still fails (as it should)
print(tokenizer.encode("This piece of text is at least ten tokens long. In fact, it is likely many more than that."))

tokenizer = Tokenizer.from_pretrained('bert-base-cased')
tokenizer.enable_truncation(max_length=10, stride=8)  # Now this line (correctly) fails too
print(tokenizer.encode("This piece of text is at least ten tokens long. In fact, it is likely many more than that."))
HuggingFaceDocBuilderDev commented 8 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker commented 6 months ago

Sorry @boyleconnor I'll see if this should be merged or not, I am not sure either 😉 This was not really breaking but the fix is breaking in a way

boyleconnor commented 6 months ago

@ArthurZucker I'm not sure what you what you mean, would you mind elaborating?