huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

Wrapping Tokenizer leads to version error #1364

Closed shivanraptor closed 9 months ago

shivanraptor commented 9 months ago

Here is my code to wrap my custom tokenizer with a PreTrainedTokenizerFast:

from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=my_custom_tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",
)

wrapped_tokenizer.push_to_hub('my-tokenizer') # assumed I have already logged in to HuggingFace Hub

It results in error:

ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.14.1.
Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git main

I believe it's due to my recent update of tokenizer 0.14.1 and transformers 4.34.0 do not support this version.

I tried to downgrade tokenizer to 0.13.3 but the transformers 4.34.0 does not support it. I guess I have to wait for the next update of transformers.

shivanraptor commented 9 months ago

I have to downgrade to tokenizer 0.13.3 and transformers 4.28.0 in order to make the above codes to work.

ArthurZucker commented 9 months ago

That not expected, transformers supports tokenizers<=0.15 see here. Make sure to check the packages in your running environment. If you can send an snippet of the custom tokenizer class I can try to reproduce this if needed

shivanraptor commented 9 months ago

Maybe it's about caching issue. The latest versions work fine now. Sorry for the confusion.