Fix truncation length assertion

boyleconnor commented 8 months ago

Fixes #1326

I haven't seen any explanation from @Narsil why the line in question was changed, so I just went ahead and made this PR reversing it.

I've tested my version of this locally and can confirm that there are no longer two different errors at different places for slightly different invalid stride values:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('bert-base-cased')
tokenizer.enable_truncation(max_length=10, stride=9)  # This line still fails (as it should)
print(tokenizer.encode("This piece of text is at least ten tokens long. In fact, it is likely many more than that."))

tokenizer = Tokenizer.from_pretrained('bert-base-cased')
tokenizer.enable_truncation(max_length=10, stride=8)  # Now this line (correctly) fails too
print(tokenizer.encode("This piece of text is at least ten tokens long. In fact, it is likely many more than that."))

HuggingFaceDocBuilderDev commented 8 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker commented 6 months ago

Sorry @boyleconnor I'll see if this should be merged or not, I am not sure either 😉 This was not really breaking but the fix is breaking in a way

boyleconnor commented 6 months ago

@ArthurZucker I'm not sure what you what you mean, would you mind elaborating?

huggingface / tokenizers

Fix truncation length assertion #1382