huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.67k stars 743 forks source link

Enable `dropout = 0.0` as an equivalent to `none` in BPE #1550

Closed mcognetta closed 1 week ago

mcognetta commented 3 weeks ago

This is related to the discussion in #1541.

This PR allows for 0.0 to be used as the dropout value in BPE models with equivalent functionality to none. Previously, the docs and implementation were inconsistent:

This simply allows for 0.0 to be an acceptable value during initialization and enables caching when tokenizing if dropout == 0.0.

E.g., now the following works

>>> from tokenizers import Tokenizer, models
>>> tokenizer = Tokenizer(models.BPE(dropout = 0.0))
>>> s = tokenizer.to_str()
>>> s
'{"version":"1.0","truncation":null,"padding":null,"added_tokens":[],"normalizer":null,"pre_tokenizer":null,"post_processor":null,"decoder":null,"model":{"type":"BPE","dropout":0.0,"unk_token":null,"continuing_subword_prefix":null,"end_of_word_suffix":null,"fuse_unk":false,"byte_fallback":false,"ignore_merges":false,"vocab":{},"merges":[]}}'
>>> deserialized = Tokenizer.from_str(s)
>>> deserialized.model.dropout
0.0

whereas before it errored.


As future work, I think that dropout should be made non-optional, with the default being 0.0. This would remove the checks for dropout.is_none(), etc, but keep the functionality the same. However, I guess this would be a breaking change (since then all tokenizers serialized before this change would be invalid?).

HuggingFaceDocBuilderDev commented 3 weeks ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker commented 3 weeks ago

rebasing on main should fix clippy issues

mcognetta commented 2 weeks ago

Thanks. I fixed one of the lint errors which was a range readability thing.

mcognetta commented 2 weeks ago

Fixed one more formatting issue. Now I think it should be all good!

mcognetta commented 2 weeks ago

I goofed, and now its getting worse 😬

mcognetta commented 2 weeks ago

Ok, it's all good now, unless you want me to squash the commits.