huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.48k stars 26.89k forks source link

[Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID #29489

Closed Ki-Seki closed 1 month ago

Ki-Seki commented 8 months ago

System Info

Who can help?

@ArthurZucker

Information

Tasks

Reproduction

Code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
int_single_id = tokenizer.vocab_size-1
list_single_id = [tokenizer.vocab_size-1]
print(f'<<<<{tokenizer.decode(int_single_id)}>>>>')
print(f'<<<<{tokenizer.decode(list_single_id)}>>>>')

tokenizer = AutoTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base", use_fast=False)
int_single_id = tokenizer.vocab_size-1
list_single_id = [tokenizer.vocab_size-1]
print(f'<<<<{tokenizer.decode(int_single_id)}>>>>')
print(f'<<<<{tokenizer.decode(list_single_id)}>>>>')

# Roughly estimated, around 15 models would have this issue.

Output

<<<<# # ~>>>>
<<<<##~>>>>
<<<<# # ~>>>>
<<<<##~>>>>

Expected behavior

Consistent behaviors. For example, when decoding the single ID, the output could also be ##~.

Suspected rationale: In the src/transformers/tokenization_utils.py, the _decode function incorrectly uses spaces_between_special_tokens, and then adds spaces between the sub-tokens.

ArthurZucker commented 8 months ago

That's very interesting, and can confirm we have this issue. gemma would just error out if you pass an int and not a list, with no proper warning. While the fast works. I think adding a test in the test_tokenization_common will help know which models fails and which we have to update.

Ki-Seki commented 8 months ago

Yes, you're right. I added this test case in the test_tokenization_common:

    def test_single_id(self):
        tokenizer = self.get_tokenizer()
        rust_tokenizer = self.get_rust_tokenizer()
        vocab_size = len(tokenizer)
        int_single_id = vocab_size - 1
        list_single_id = [vocab_size - 1]
        self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
        self.assertEqual(rust_tokenizer.decode(int_single_id), rust_tokenizer.decode(list_single_id))

The test results are as below (scroll to the bottom to view the failed 33 models):

Details

```text > self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id)) E AssertionError: 'l o w e s t' != 'lowest' E - l o w e s t E + lowest tests/test_tokenization_common.py:4208: AssertionError __________________ SqueezeBertTokenizationTest.test_single_id __________________ self = def test_single_id(self): tokenizer = self.get_tokenizer() rust_tokenizer = self.get_rust_tokenizer() vocab_size = len(tokenizer) int_single_id = vocab_size - 1 list_single_id = [vocab_size - 1] > self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id)) E AssertionError: 'l o w e s t' != 'lowest' E - l o w e s t E + lowest tests/test_tokenization_common.py:4208: AssertionError _____________________ TapasTokenizationTest.test_single_id _____________________ self = def test_single_id(self): tokenizer = self.get_tokenizer() > rust_tokenizer = self.get_rust_tokenizer() tests/test_tokenization_common.py:4204: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = kwargs = {} def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast: > return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs) E AttributeError: 'NoneType' object has no attribute 'from_pretrained' tests/test_tokenization_common.py:272: AttributeError _______________________ VitsTokenizerTest.test_single_id _______________________ self = def test_single_id(self): tokenizer = self.get_tokenizer() > rust_tokenizer = self.get_rust_tokenizer() tests/test_tokenization_common.py:4204: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = kwargs = {} def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast: > return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs) E AttributeError: 'NoneType' object has no attribute 'from_pretrained' tests/test_tokenization_common.py:272: AttributeError ___________________ Wav2Vec2CTCTokenizerTest.test_single_id ____________________ self = def test_single_id(self): tokenizer = self.get_tokenizer() > rust_tokenizer = self.get_rust_tokenizer() tests/test_tokenization_common.py:4204: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = kwargs = {} def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast: > return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs) E AttributeError: 'NoneType' object has no attribute 'from_pretrained' tests/test_tokenization_common.py:272: AttributeError ________________ Wav2Vec2PhonemeCTCTokenizerTest.test_single_id ________________ self = def test_single_id(self): > tokenizer = self.get_tokenizer() tests/test_tokenization_common.py:4203: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ tests/models/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py:87: in get_tokenizer return Wav2Vec2PhonemeCTCTokenizer.from_pretrained(self.tmpdirname, **kwargs) src/transformers/tokenization_utils_base.py:2055: in from_pretrained return cls._from_pretrained( src/transformers/tokenization_utils_base.py:2294: in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py:153: in __init__ self.init_backend(self.phonemizer_lang) src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py:202: in init_backend self.backend = BACKENDS[self.phonemizer_backend](phonemizer_lang, language_switch="remove-flags") /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/backend/espeak/espeak.py:45: in __init__ super().__init__( /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/backend/espeak/base.py:39: in __init__ super().__init__( _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = language = 'en-us', punctuation_marks = ';:,.!?¡¿—…"«»“”' preserve_punctuation = False, logger = def __init__(self, language: str, punctuation_marks: Optional[Union[str, Pattern]] = None, preserve_punctuation: bool = False, logger: Optional[Logger] = None): if punctuation_marks is None: punctuation_marks = Punctuation.default_marks() if logger is None: logger = get_logger() # ensure the backend is installed on the system if not self.is_available(): > raise RuntimeError( # pragma: nocover '{} not installed on your system'.format(self.name())) E RuntimeError: espeak not installed on your system /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/backend/base.py:77: RuntimeError _____________________ WhisperTokenizerTest.test_single_id ______________________ tests/models/whisper/test_tokenization_whisper.py:42: in setUp tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny") _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = pretrained_model_name_or_path = 'openai/whisper-tiny', cache_dir = None force_download = False, local_files_only = False, token = None revision = 'main', trust_remote_code = False, init_inputs = (), kwargs = {} resume_download = False, proxies = None, use_auth_token = None, subfolder = None @classmethod def from_pretrained( cls, pretrained_model_name_or_path: Union[str, os.PathLike], *init_inputs, cache_dir: Optional[Union[str, os.PathLike]] = None, force_download: bool = False, local_files_only: bool = False, token: Optional[Union[str, bool]] = None, revision: str = "main", trust_remote_code=False, **kwargs, ): r""" Instantiate a [`~tokenization_utils_base.PreTrainedTokenizerBase`] (or a derived class) from a predefined tokenizer. Args: pretrained_model_name_or_path (`str` or `os.PathLike`): Can be either: - A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co. - A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved using the [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`] method, e.g., `./my_model_directory/`. - (**Deprecated**, not applicable to all derived classes) A path or url to a single saved vocabulary file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g., `./my_model_directory/vocab.txt`. cache_dir (`str` or `os.PathLike`, *optional*): Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used. force_download (`bool`, *optional*, defaults to `False`): Whether or not to force the (re-)download the vocabulary files and override the cached versions if they exist. resume_download (`bool`, *optional*, defaults to `False`): Whether or not to delete incompletely received files. Attempt to resume the download if such a file exists. proxies (`Dict[str, str]`, *optional*): A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. token (`str` or *bool*, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `huggingface-cli login` (stored in `~/.huggingface`). local_files_only (`bool`, *optional*, defaults to `False`): Whether or not to only rely on local files and not to attempt to download any files. revision (`str`, *optional*, defaults to `"main"`): The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any identifier allowed by git. subfolder (`str`, *optional*): In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base), specify it here. inputs (additional positional arguments, *optional*): Will be passed along to the Tokenizer `__init__` method. trust_remote_code (`bool`, *optional*, defaults to `False`): Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to `True` for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine. kwargs (additional keyword arguments, *optional*): Will be passed to the Tokenizer `__init__` method. Can be used to set special tokens like `bos_token`, `eos_token`, `unk_token`, `sep_token`, `pad_token`, `cls_token`, `mask_token`, `additional_special_tokens`. See parameters in the `__init__` for more details. Passing `token=True` is required when you want to use a private model. Examples: ```python # We can't instantiate directly the base class *PreTrainedTokenizerBase* so let's show our examples on a derived class: BertTokenizer # Download vocabulary from huggingface.co and cache. tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") # Download vocabulary from huggingface.co (user-uploaded) and cache. tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-cased") # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*) tokenizer = BertTokenizer.from_pretrained("./test/saved_model/") # If the tokenizer uses a single vocabulary file, you can point directly to this file tokenizer = BertTokenizer.from_pretrained("./test/saved_model/my_vocab.txt") # You can link tokens to special vocabulary when instantiating tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased", unk_token="") # You should be sure '' is in the vocabulary when doing that. # Otherwise use tokenizer.add_special_tokens({'unk_token': ''}) instead) assert tokenizer.unk_token == "" ```""" resume_download = kwargs.pop("resume_download", False) proxies = kwargs.pop("proxies", None) use_auth_token = kwargs.pop("use_auth_token", None) subfolder = kwargs.pop("subfolder", None) from_pipeline = kwargs.pop("_from_pipeline", None) from_auto_class = kwargs.pop("_from_auto", False) commit_hash = kwargs.pop("_commit_hash", None) if use_auth_token is not None: warnings.warn( "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", FutureWarning, ) if token is not None: raise ValueError( "`token` and `use_auth_token` are both specified. Please set only the argument `token`." ) token = use_auth_token user_agent = {"file_type": "tokenizer", "from_auto_class": from_auto_class, "is_fast": "Fast" in cls.__name__} if from_pipeline is not None: user_agent["using_pipeline"] = from_pipeline if is_offline_mode() and not local_files_only: logger.info("Offline mode: forcing local_files_only=True") local_files_only = True pretrained_model_name_or_path = str(pretrained_model_name_or_path) vocab_files = {} init_configuration = {} is_local = os.path.isdir(pretrained_model_name_or_path) single_file_id = None if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path): if len(cls.vocab_files_names) > 1: raise ValueError( f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is not " "supported for this tokenizer. Use a model identifier or the path to a directory instead." ) warnings.warn( f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is deprecated and " "won't be possible anymore in v5. Use a model identifier or the path to a directory instead.", FutureWarning, ) file_id = list(cls.vocab_files_names.keys())[0] vocab_files[file_id] = pretrained_model_name_or_path single_file_id = file_id else: # At this point pretrained_model_name_or_path is either a directory or a model identifier name additional_files_names = { "added_tokens_file": ADDED_TOKENS_FILE, # kept only for legacy "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE, # kept only for legacy "tokenizer_config_file": TOKENIZER_CONFIG_FILE, # tokenizer_file used to initialize a slow from a fast. Properly copy the `addedTokens` instead of adding in random orders "tokenizer_file": FULL_TOKENIZER_FILE, } vocab_files = {**cls.vocab_files_names, **additional_files_names} if "tokenizer_file" in vocab_files: # Try to get the tokenizer config to see if there are versioned tokenizer files. fast_tokenizer_file = FULL_TOKENIZER_FILE resolved_config_file = cached_file( pretrained_model_name_or_path, TOKENIZER_CONFIG_FILE, cache_dir=cache_dir, force_download=force_download, resume_download=resume_download, proxies=proxies, token=token, revision=revision, local_files_only=local_files_only, subfolder=subfolder, user_agent=user_agent, _raise_exceptions_for_gated_repo=False, _raise_exceptions_for_missing_entries=False, _raise_exceptions_for_connection_errors=False, _commit_hash=commit_hash, ) commit_hash = extract_commit_hash(resolved_config_file, commit_hash) if resolved_config_file is not None: with open(resolved_config_file, encoding="utf-8") as reader: tokenizer_config = json.load(reader) if "fast_tokenizer_files" in tokenizer_config: fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"]) vocab_files["tokenizer_file"] = fast_tokenizer_file # Get files from url, cache, or disk depending on the case resolved_vocab_files = {} unresolved_files = [] for file_id, file_path in vocab_files.items(): if file_path is None: resolved_vocab_files[file_id] = None elif single_file_id == file_id: if os.path.isfile(file_path): resolved_vocab_files[file_id] = file_path elif is_remote_url(file_path): resolved_vocab_files[file_id] = download_url(file_path, proxies=proxies) else: resolved_vocab_files[file_id] = cached_file( pretrained_model_name_or_path, file_path, cache_dir=cache_dir, force_download=force_download, proxies=proxies, resume_download=resume_download, local_files_only=local_files_only, token=token, user_agent=user_agent, revision=revision, subfolder=subfolder, _raise_exceptions_for_gated_repo=False, _raise_exceptions_for_missing_entries=False, _raise_exceptions_for_connection_errors=False, _commit_hash=commit_hash, ) commit_hash = extract_commit_hash(resolved_vocab_files[file_id], commit_hash) if len(unresolved_files) > 0: logger.info( f"Can't load following files from cache: {unresolved_files} and cannot check if these " "files are necessary for the tokenizer to operate." ) if all(full_file_name is None for full_file_name in resolved_vocab_files.values()): > raise EnvironmentError( f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from " "'https://huggingface.co/models', make sure you don't have a local directory with the same name. " f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory " f"containing all relevant files for a {cls.__name__} tokenizer." ) E OSError: Can't load tokenizer for 'openai/whisper-tiny'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'openai/whisper-tiny' is the correct path to a directory containing all relevant files for a WhisperTokenizer tokenizer. src/transformers/tokenization_utils_base.py:2039: OSError ______________________ XLMTokenizationTest.test_single_id ______________________ self = def test_single_id(self): tokenizer = self.get_tokenizer() > rust_tokenizer = self.get_rust_tokenizer() tests/test_tokenization_common.py:4204: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = kwargs = {} def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast: > return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs) E AttributeError: 'NoneType' object has no attribute 'from_pretrained' tests/test_tokenization_common.py:272: AttributeError _________________ XLMProphetNetTokenizationTest.test_single_id _________________ self = def test_single_id(self): tokenizer = self.get_tokenizer() > rust_tokenizer = self.get_rust_tokenizer() tests/test_tokenization_common.py:4204: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = kwargs = {} def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast: > return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs) E AttributeError: 'NoneType' object has no attribute 'from_pretrained' tests/test_tokenization_common.py:272: AttributeError ________________ PreTrainedTokenizationFastTest.test_single_id _________________ tests/tokenization/test_tokenization_fast.py:47: in setUp tokenizer = PreTrainedTokenizerFast.from_pretrained(model_paths[0]) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = pretrained_model_name_or_path = 'robot-test/dummy-tokenizer-fast' cache_dir = None, force_download = False, local_files_only = False, token = None revision = 'main', trust_remote_code = False, init_inputs = (), kwargs = {} resume_download = False, proxies = None, use_auth_token = None, subfolder = None @classmethod def from_pretrained( cls, pretrained_model_name_or_path: Union[str, os.PathLike], *init_inputs, cache_dir: Optional[Union[str, os.PathLike]] = None, force_download: bool = False, local_files_only: bool = False, token: Optional[Union[str, bool]] = None, revision: str = "main", trust_remote_code=False, **kwargs, ): r""" Instantiate a [`~tokenization_utils_base.PreTrainedTokenizerBase`] (or a derived class) from a predefined tokenizer. Args: pretrained_model_name_or_path (`str` or `os.PathLike`): Can be either: - A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co. - A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved using the [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`] method, e.g., `./my_model_directory/`. - (**Deprecated**, not applicable to all derived classes) A path or url to a single saved vocabulary file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g., `./my_model_directory/vocab.txt`. cache_dir (`str` or `os.PathLike`, *optional*): Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used. force_download (`bool`, *optional*, defaults to `False`): Whether or not to force the (re-)download the vocabulary files and override the cached versions if they exist. resume_download (`bool`, *optional*, defaults to `False`): Whether or not to delete incompletely received files. Attempt to resume the download if such a file exists. proxies (`Dict[str, str]`, *optional*): A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. token (`str` or *bool*, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `huggingface-cli login` (stored in `~/.huggingface`). local_files_only (`bool`, *optional*, defaults to `False`): Whether or not to only rely on local files and not to attempt to download any files. revision (`str`, *optional*, defaults to `"main"`): The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any identifier allowed by git. subfolder (`str`, *optional*): In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base), specify it here. inputs (additional positional arguments, *optional*): Will be passed along to the Tokenizer `__init__` method. trust_remote_code (`bool`, *optional*, defaults to `False`): Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to `True` for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine. kwargs (additional keyword arguments, *optional*): Will be passed to the Tokenizer `__init__` method. Can be used to set special tokens like `bos_token`, `eos_token`, `unk_token`, `sep_token`, `pad_token`, `cls_token`, `mask_token`, `additional_special_tokens`. See parameters in the `__init__` for more details. Passing `token=True` is required when you want to use a private model. Examples: ```python # We can't instantiate directly the base class *PreTrainedTokenizerBase* so let's show our examples on a derived class: BertTokenizer # Download vocabulary from huggingface.co and cache. tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") # Download vocabulary from huggingface.co (user-uploaded) and cache. tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-cased") # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*) tokenizer = BertTokenizer.from_pretrained("./test/saved_model/") # If the tokenizer uses a single vocabulary file, you can point directly to this file tokenizer = BertTokenizer.from_pretrained("./test/saved_model/my_vocab.txt") # You can link tokens to special vocabulary when instantiating tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased", unk_token="") # You should be sure '' is in the vocabulary when doing that. # Otherwise use tokenizer.add_special_tokens({'unk_token': ''}) instead) assert tokenizer.unk_token == "" ```""" resume_download = kwargs.pop("resume_download", False) proxies = kwargs.pop("proxies", None) use_auth_token = kwargs.pop("use_auth_token", None) subfolder = kwargs.pop("subfolder", None) from_pipeline = kwargs.pop("_from_pipeline", None) from_auto_class = kwargs.pop("_from_auto", False) commit_hash = kwargs.pop("_commit_hash", None) if use_auth_token is not None: warnings.warn( "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", FutureWarning, ) if token is not None: raise ValueError( "`token` and `use_auth_token` are both specified. Please set only the argument `token`." ) token = use_auth_token user_agent = {"file_type": "tokenizer", "from_auto_class": from_auto_class, "is_fast": "Fast" in cls.__name__} if from_pipeline is not None: user_agent["using_pipeline"] = from_pipeline if is_offline_mode() and not local_files_only: logger.info("Offline mode: forcing local_files_only=True") local_files_only = True pretrained_model_name_or_path = str(pretrained_model_name_or_path) vocab_files = {} init_configuration = {} is_local = os.path.isdir(pretrained_model_name_or_path) single_file_id = None if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path): if len(cls.vocab_files_names) > 1: raise ValueError( f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is not " "supported for this tokenizer. Use a model identifier or the path to a directory instead." ) warnings.warn( f"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is deprecated and " "won't be possible anymore in v5. Use a model identifier or the path to a directory instead.", FutureWarning, ) file_id = list(cls.vocab_files_names.keys())[0] vocab_files[file_id] = pretrained_model_name_or_path single_file_id = file_id else: # At this point pretrained_model_name_or_path is either a directory or a model identifier name additional_files_names = { "added_tokens_file": ADDED_TOKENS_FILE, # kept only for legacy "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE, # kept only for legacy "tokenizer_config_file": TOKENIZER_CONFIG_FILE, # tokenizer_file used to initialize a slow from a fast. Properly copy the `addedTokens` instead of adding in random orders "tokenizer_file": FULL_TOKENIZER_FILE, } vocab_files = {**cls.vocab_files_names, **additional_files_names} if "tokenizer_file" in vocab_files: # Try to get the tokenizer config to see if there are versioned tokenizer files. fast_tokenizer_file = FULL_TOKENIZER_FILE resolved_config_file = cached_file( pretrained_model_name_or_path, TOKENIZER_CONFIG_FILE, cache_dir=cache_dir, force_download=force_download, resume_download=resume_download, proxies=proxies, token=token, revision=revision, local_files_only=local_files_only, subfolder=subfolder, user_agent=user_agent, _raise_exceptions_for_gated_repo=False, _raise_exceptions_for_missing_entries=False, _raise_exceptions_for_connection_errors=False, _commit_hash=commit_hash, ) commit_hash = extract_commit_hash(resolved_config_file, commit_hash) if resolved_config_file is not None: with open(resolved_config_file, encoding="utf-8") as reader: tokenizer_config = json.load(reader) if "fast_tokenizer_files" in tokenizer_config: fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"]) vocab_files["tokenizer_file"] = fast_tokenizer_file # Get files from url, cache, or disk depending on the case resolved_vocab_files = {} unresolved_files = [] for file_id, file_path in vocab_files.items(): if file_path is None: resolved_vocab_files[file_id] = None elif single_file_id == file_id: if os.path.isfile(file_path): resolved_vocab_files[file_id] = file_path elif is_remote_url(file_path): resolved_vocab_files[file_id] = download_url(file_path, proxies=proxies) else: resolved_vocab_files[file_id] = cached_file( pretrained_model_name_or_path, file_path, cache_dir=cache_dir, force_download=force_download, proxies=proxies, resume_download=resume_download, local_files_only=local_files_only, token=token, user_agent=user_agent, revision=revision, subfolder=subfolder, _raise_exceptions_for_gated_repo=False, _raise_exceptions_for_missing_entries=False, _raise_exceptions_for_connection_errors=False, _commit_hash=commit_hash, ) commit_hash = extract_commit_hash(resolved_vocab_files[file_id], commit_hash) if len(unresolved_files) > 0: logger.info( f"Can't load following files from cache: {unresolved_files} and cannot check if these " "files are necessary for the tokenizer to operate." ) if all(full_file_name is None for full_file_name in resolved_vocab_files.values()): > raise EnvironmentError( f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from " "'https://huggingface.co/models', make sure you don't have a local directory with the same name. " f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory " f"containing all relevant files for a {cls.__name__} tokenizer." ) E OSError: Can't load tokenizer for 'robot-test/dummy-tokenizer-fast'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'robot-test/dummy-tokenizer-fast' is the correct path to a directory containing all relevant files for a PreTrainedTokenizerFast tokenizer. src/transformers/tokenization_utils_base.py:2039: OSError =============================== warnings summary =============================== ../../../../../home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/_pytest/config/__init__.py:1373 /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob self._warn_or_fail_if_strict(f"Unknown config option: {key}\n") ../../../../../home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/utils.py:22 /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/phonemizer/utils.py:22: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import pkg_resources ../../../../../home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/pkg_resources/__init__.py:2846 /home/llm/anaconda3/usr/shichao/envs/fix/lib/python3.8/site-packages/pkg_resources/__init__.py:2846: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`. Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages declare_namespace(pkg) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED tests/models/bartpho/test_tokenization_bartpho.py::BartphoTokenizerTest::test_single_id FAILED tests/models/bert/test_tokenization_bert.py::BertTokenizationTest::test_single_id FAILED tests/models/bert_generation/test_tokenization_bert_generation.py::BertGenerationTokenizationTest::test_single_id FAILED tests/models/bertweet/test_tokenization_bertweet.py::BertweetTokenizationTest::test_single_id FAILED tests/models/biogpt/test_tokenization_biogpt.py::BioGptTokenizationTest::test_single_id FAILED tests/models/blenderbot_small/test_tokenization_blenderbot_small.py::BlenderbotSmallTokenizerTest::test_single_id FAILED tests/models/bloom/test_tokenization_bloom.py::BloomTokenizationTest::test_single_id FAILED tests/models/byt5/test_tokenization_byt5.py::ByT5TokenizationTest::test_single_id FAILED tests/models/canine/test_tokenization_canine.py::CanineTokenizationTest::test_single_id FAILED tests/models/clvp/test_tokenization_clvp.py::ClvpTokenizationTest::test_single_id FAILED tests/models/code_llama/test_tokenization_code_llama.py::CodeLlamaTokenizationTest::test_single_id FAILED tests/models/ctrl/test_tokenization_ctrl.py::CTRLTokenizationTest::test_single_id FAILED tests/models/distilbert/test_tokenization_distilbert.py::BertTokenizationTest::test_single_id FAILED tests/models/distilbert/test_tokenization_distilbert.py::DistilBertTokenizationTest::test_single_id FAILED tests/models/dpr/test_tokenization_dpr.py::BertTokenizationTest::test_single_id FAILED tests/models/dpr/test_tokenization_dpr.py::DPRContextEncoderTokenizationTest::test_single_id FAILED tests/models/dpr/test_tokenization_dpr.py::DPRQuestionEncoderTokenizationTest::test_single_id FAILED tests/models/dpr/test_tokenization_dpr.py::DPRReaderTokenizationTest::test_single_id FAILED tests/models/electra/test_tokenization_electra.py::ElectraTokenizationTest::test_single_id FAILED tests/models/ernie_m/test_tokenization_ernie_m.py::ErnieMTokenizationTest::test_single_id FAILED tests/models/fsmt/test_tokenization_fsmt.py::FSMTTokenizationTest::test_single_id FAILED tests/models/funnel/test_tokenization_funnel.py::FunnelTokenizationTest::test_single_id FAILED tests/models/gemma/test_tokenization_gemma.py::GemmaTokenizationTest::test_single_id FAILED tests/models/gpt_neox_japanese/test_tokenization_gpt_neox_japanese.py::GPTNeoXJapaneseTokenizationTest::test_single_id FAILED tests/models/gpt_sw3/test_tokenization_gpt_sw3.py::GPTSw3TokenizationTest::test_single_id FAILED tests/models/gptsan_japanese/test_tokenization_gptsan_japanese.py::GPTSanJapaneseTokenizationTest::test_single_id FAILED tests/models/layoutlm/test_tokenization_layoutlm.py::LayoutLMTokenizationTest::test_single_id FAILED tests/models/layoutlmv2/test_tokenization_layoutlmv2.py::LayoutLMv2TokenizationTest::test_single_id FAILED tests/models/luke/test_tokenization_luke.py::LukeTokenizerTest::test_single_id FAILED tests/models/lxmert/test_tokenization_lxmert.py::LxmertTokenizationTest::test_single_id FAILED tests/models/m2m_100/test_tokenization_m2m_100.py::M2M100TokenizationTest::test_single_id FAILED tests/models/marian/test_tokenization_marian.py::MarianTokenizationTest::test_single_id FAILED tests/models/mgp_str/test_tokenization_mgp_str.py::MgpstrTokenizationTest::test_single_id FAILED tests/models/mluke/test_tokenization_mluke.py::MLukeTokenizerTest::test_single_id FAILED tests/models/mobilebert/test_tokenization_mobilebert.py::MobileBERTTokenizationTest::test_single_id FAILED tests/models/mpnet/test_tokenization_mpnet.py::MPNetTokenizerTest::test_single_id FAILED tests/models/nougat/test_tokenization_nougat.py::NougatTokenizationTest::test_single_id FAILED tests/models/perceiver/test_tokenization_perceiver.py::PerceiverTokenizationTest::test_single_id FAILED tests/models/phobert/test_tokenization_phobert.py::PhobertTokenizationTest::test_single_id FAILED tests/models/plbart/test_tokenization_plbart.py::PLBartTokenizationTest::test_single_id FAILED tests/models/prophetnet/test_tokenization_prophetnet.py::ProphetNetTokenizationTest::test_single_id FAILED tests/models/realm/test_tokenization_realm.py::RealmTokenizationTest::test_single_id FAILED tests/models/roc_bert/test_tokenization_roc_bert.py::BertTokenizationTest::test_single_id FAILED tests/models/roformer/test_tokenization_roformer.py::RoFormerTokenizationTest::test_single_id FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_single_id FAILED tests/models/speech_to_text/test_tokenization_speech_to_text.py::SpeechToTextTokenizerTest::test_single_id FAILED tests/models/speech_to_text_2/test_tokenization_speech_to_text_2.py::SpeechToTextTokenizerTest::test_single_id FAILED tests/models/speecht5/test_tokenization_speecht5.py::SpeechT5TokenizerTest::test_single_id FAILED tests/models/squeezebert/test_tokenization_squeezebert.py::BertTokenizationTest::test_single_id FAILED tests/models/squeezebert/test_tokenization_squeezebert.py::SqueezeBertTokenizationTest::test_single_id FAILED tests/models/tapas/test_tokenization_tapas.py::TapasTokenizationTest::test_single_id FAILED tests/models/vits/test_tokenization_vits.py::VitsTokenizerTest::test_single_id FAILED tests/models/wav2vec2/test_tokenization_wav2vec2.py::Wav2Vec2CTCTokenizerTest::test_single_id FAILED tests/models/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py::Wav2Vec2PhonemeCTCTokenizerTest::test_single_id FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_single_id FAILED tests/models/xlm/test_tokenization_xlm.py::XLMTokenizationTest::test_single_id FAILED tests/models/xlm_prophetnet/test_tokenization_xlm_prophetnet.py::XLMProphetNetTokenizationTest::test_single_id FAILED tests/tokenization/test_tokenization_fast.py::PreTrainedTokenizationFastTest::test_single_id ============ 58 failed, 33 passed, 6 skipped, 3 warnings in 13.87s ============= <<

ArthurZucker commented 8 months ago

feel free to open a PR for a fix. IMO we should not have spaces added in this case

Ki-Seki commented 8 months ago

No problem, I will try to do this, but there are some other research work that needs to be pushed forward recently, and I may do it later.

MariaHei commented 4 months ago

Hi :) I'm pretty sure the issue is not how spaces_between_special_tokens is used but that single tokens are split into letters here. To fix it, I'd suggest adding the following before iterating over the tokens:

if isinstance(filtered_tokens, str):
       filtered_tokens = [filtered_tokens]

I ran a couple of the test cases that were reported to be failing above with a slightly modified version of the test function proposed by @Ki-Seki and they pass now

def test_single_id(self):
        tokenizer = self.get_tokenizer()
        vocab_size = len(tokenizer)
        int_single_id = vocab_size - 1
        list_single_id = [vocab_size - 1]
        self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
        if self.test_rust_tokenizer:
            rust_tokenizer = self.get_rust_tokenizer()
            self.assertEqual(rust_tokenizer.decode(int_single_id), rust_tokenizer.decode(list_single_id))

Unfortunately, I can't run all of the test cases (I keep running into weird python segmentation errors that occur even without having changed the library at all). Does know a trick how I can run the test cases anyway or is it ok if I create a pull request and wait for the CI tests?

ArthurZucker commented 3 months ago

You can create a PR and rely on the CIs for sure! 🤗

DuyguA commented 2 months ago

Hello @ArthurZucker and all, I don't think this is an issue related to the specific ids, but rather a general problem. I tested a bit on my local but to make sure my local setup isn't related, I tested on Colab:

colab_ids

Looks to me problem is that (i) a signature mismatch between PretrainedTokenizerBase and PretrainedTokenizer classes _decode methods: https://github.com/huggingface/transformers/blob/74b92c62560b7ade42d35a49f9063adc8b805c4a/src/transformers/tokenization_utils_base.py#L3913-L3915 https://github.com/huggingface/transformers/blob/74b92c62560b7ade42d35a49f9063adc8b805c4a/src/transformers/tokenization_utils.py#L1062-L1064

FastTokenizer has this signature correctly: https://github.com/huggingface/transformers/blob/74b92c62560b7ade42d35a49f9063adc8b805c4a/src/transformers/tokenization_utils_fast.py#L640-L642 Consequently slow tokenizer _decode handles only list of ids, not a single id. If the filtered_tokens is a single string, not a list of strings in the loop its characters are iterated and processed so @MariaHei is totally right: https://github.com/huggingface/transformers/blob/74b92c62560b7ade42d35a49f9063adc8b805c4a/src/transformers/tokenization_utils.py#L1082

Also there are not many decoding tests, though lots of encoding tests :blush: I added quick signature fix and return statements, also added some decode tests in my PR.