Tokenizer requires protobuf 3

Maykeye commented 1 year ago

(Transformers v4.30.2)

Tokenizer of openllama can't be used out of the box unless protobuf=3 is installed or env variables are changed. And since many packages require v4 now, protobuf is prone to upgrade, then this happens:

In [3]: AutoTokenizer.from_pretrained("openlm-research/open_llama_3b")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 AutoTokenizer.from_pretrained("openlm-research/open_llama_3b")

File ~/src/sd/sd/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:691, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    687     if tokenizer_class is None:
    688         raise ValueError(
    689             f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
    690         )
--> 691     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    693 # Otherwise we have to be creative.
    694 # if model is an encoder decoder, the encoder tokenizer class is used by default
    695 if isinstance(config, EncoderDecoderConfig):

File ~/src/sd/sd/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1825, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1822     else:
   1823         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1825 return cls._from_pretrained(
   1826     resolved_vocab_files,
   1827     pretrained_model_name_or_path,
   1828     init_configuration,
   1829     *init_inputs,
   1830     use_auth_token=use_auth_token,
   1831     cache_dir=cache_dir,
   1832     local_files_only=local_files_only,
   1833     _commit_hash=commit_hash,
   1834     _is_local=is_local,
   1835     **kwargs,
   1836 )

File ~/src/sd/sd/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1988, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
   1986 # Instantiate tokenizer.
   1987 try:
-> 1988     tokenizer = cls(*init_inputs, **init_kwargs)
   1989 except OSError:
   1990     raise OSError(
   1991         "Unable to load vocabulary from file. "
   1992         "Please check that the provided vocabulary is accessible and not corrupted."
   1993     )

File ~/src/sd/sd/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama_fast.py:93, in LlamaTokenizerFast.__init__(self, vocab_file, tokenizer_file, clean_up_tokenization_spaces, unk_token, bos_token, eos_token, add_bos_token, add_eos_token, **kwargs)
     81 def __init__(
     82     self,
     83     vocab_file=None,
   (...)
     91     **kwargs,
     92 ):
---> 93     super().__init__(
     94         vocab_file=vocab_file,
     95         tokenizer_file=tokenizer_file,
     96         clean_up_tokenization_spaces=clean_up_tokenization_spaces,
     97         unk_token=unk_token,
     98         bos_token=bos_token,
     99         eos_token=eos_token,
    100         **kwargs,
    101     )
    102     self._add_bos_token = add_bos_token
    103     self._add_eos_token = add_eos_token

File ~/src/sd/sd/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:114, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    111     fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    112 elif slow_tokenizer is not None:
    113     # We need to convert a slow tokenizer to build the backend
--> 114     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
    115 elif self.slow_tokenizer_class is not None:
    116     # We need to create and convert a slow tokenizer to build the backend
    117     slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)

File ~/src/sd/sd/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:1307, in convert_slow_tokenizer(transformer_tokenizer)
   1299     raise ValueError(
   1300         f"An instance of tokenizer class {tokenizer_class_name} cannot be converted in a Fast tokenizer instance."
   1301         " No converter was found. Currently available slow->fast convertors:"
   1302         f" {list(SLOW_TO_FAST_CONVERTERS.keys())}"
   1303     )
   1305 converter_class = SLOW_TO_FAST_CONVERTERS[tokenizer_class_name]
-> 1307 return converter_class(transformer_tokenizer).converted()

File ~/src/sd/sd/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:445, in SpmConverter.__init__(self, *args)
    441 requires_backends(self, "protobuf")
    443 super().__init__(*args)
--> 445 from .utils import sentencepiece_model_pb2 as model_pb2
    447 m = model_pb2.ModelProto()
    448 with open(self.original_tokenizer.vocab_file, "rb") as f:

File ~/src/sd/sd/lib/python3.11/site-packages/transformers/utils/sentencepiece_model_pb2.py:91
     25 _sym_db = _symbol_database.Default()
     28 DESCRIPTOR = _descriptor.FileDescriptor(
     29     name="sentencepiece_model.proto",
     30     package="sentencepiece",
   (...)
     80     ),
     81 )
     84 _TRAINERSPEC_MODELTYPE = _descriptor.EnumDescriptor(
     85     name="ModelType",
     86     full_name="sentencepiece.TrainerSpec.ModelType",
     87     filename=None,
     88     file=DESCRIPTOR,
     89     create_key=_descriptor._internal_create_key,
     90     values=[
---> 91         _descriptor.EnumValueDescriptor(
     92             name="UNIGRAM",
     93             index=0,
     94             number=1,
     95             serialized_options=None,
     96             type=None,
     97             create_key=_descriptor._internal_create_key,
     98         ),
     99         _descriptor.EnumValueDescriptor(
    100             name="BPE",
    101             index=1,
    102             number=2,
    103             serialized_options=None,
    104             type=None,
    105             create_key=_descriptor._internal_create_key,
    106         ),
    107         _descriptor.EnumValueDescriptor(
    108             name="WORD",
    109             index=2,
    110             number=3,
    111             serialized_options=None,
    112             type=None,
    113             create_key=_descriptor._internal_create_key,
    114         ),
    115         _descriptor.EnumValueDescriptor(
    116             name="CHAR",
    117             index=3,
    118             number=4,
    119             serialized_options=None,
    120             type=None,
    121             create_key=_descriptor._internal_create_key,
    122         ),
    123     ],
    124     containing_type=None,
    125     serialized_options=None,
    126     serialized_start=1294,
    127     serialized_end=1347,
    128 )
    129 _sym_db.RegisterEnumDescriptor(_TRAINERSPEC_MODELTYPE)
    131 _MODELPROTO_SENTENCEPIECE_TYPE = _descriptor.EnumDescriptor(
    132     name="Type",
    133     full_name="sentencepiece.ModelProto.SentencePiece.Type",
   (...)
    190     serialized_end=2184,
    191 )

File ~/src/sd/sd/lib/python3.11/site-packages/google/protobuf/descriptor.py:796, in EnumValueDescriptor.__new__(cls, name, index, number, type, options, serialized_options, create_key)
    793 def __new__(cls, name, index, number,
    794             type=None,  # pylint: disable=redefined-builtin
    795             options=None, serialized_options=None, create_key=None):
--> 796   _message.Message._CheckCalledFromGeneratedFile()
    797   # There is no way we can build a complete EnumValueDescriptor with the
    798   # given parameters (the name of the Enum is not known, for example).
    799   # Fortunately generated files just pass it to the EnumDescriptor()
    800   # constructor, which will ignore it, so returning None is good enough.
    801   return None

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

(It can be fixed by following tip above

$ ipython
>>> %env PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_3b")
>>> tokenizer.save_pretrained(".")

after which AutoTokenizer.from_pretrained('.') doesn't take 2 minutes to load like AutoTokenizer.from_pretrained("openlm-research/open_llama_3b") does. It takes longer to load the tokenizer than the model)

And it's not only for 3B. The most recent model on HF as of now openlm-research/open_llama_7b_v2 also has the same issue

young-geng commented 1 year ago

The protobuf version is determined by the sentencepiece library which is used by the original LLaMA. Therefore, we have no control over that unfortunately.

Maykeye commented 1 year ago

Can't you just update Tokenizer to "fast" one, which is what transforms wants for some reason? Problem occurs when transformers try to convert to some fast version, it takes a long time. But after that if you save it and load, it seems tokenizer works without protobuf or sentencepiece.

Here is a notebook that demonstrates it:

I installed new version of protobuf to simulate a situation of trying to load tokenizer with newer version of protobuf
Tried to load the tokenizer
Got errors
Restarted as continuing to load it was no longer possible
Converted after ~7 minutes
Restarted and checked that it works
Downgraded protobuf, restarted and (quickly) checked tokenizer works
Removed protobuf entirely and checked tokenizer works
Removed sentencepiece entirely and checked tokenizer works

So whatever save_pretrained is doing, it seems that it still works.

young-geng commented 1 year ago

Unfortunately we can't. There are other inference libraries such as llama.cpp that does not use the transformers fast tokenizer. We want to make our OpenLLaMA a drop-in replacement for LLLaMA in all libraries, not just transformers.

openlm-research / open_llama

Tokenizer requires protobuf 3 #69