huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.71k stars 26.22k forks source link

"from_pretrained" read wrong config file. not "tokenizer_config.json", but "config.json" #31282

Closed daehuikim closed 1 month ago

daehuikim commented 3 months ago

Hi, I found interesting bug(maybe I could be wrong) that is in from_pretrained. below are the code that i produce my bug.

from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained(
        model,
        local_files_only=True
    )
    model.to(device)
tokenizer = T5Tokenizer.from_pretrained(
        model, 
        trust_remote_code=True, 
        TOKENIZERS_PARALLELISM=True,
        local_files_only=True,
        skip_special_tokens=True
    )

The model directory contains fine tuned T5 tensors and other necessary files with training results. Specific tree is like below

model/
ㄴ config.json  // configuration about T5 starts with archietuecture: "T5ForConditionalGeneration"
ㄴ generation_config.json
ㄴ model.safetensors
ㄴ special_tokens_map.json
ㄴ spiece.model
ㄴ toeknizer.json
ㄴ tokenizer_config.json
...(other files)

Whenever I try the code above, I can get errors like below

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "inference_script.py", line 33, in <module>
    tokenizer = T5Tokenizer.from_pretrained(
  File "/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2010, in from_pretrained
    resolved_config_file = cached_file(
  File "/python3.9/site-packages/transformers/utils/hub.py", line 462, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'T5ForConditionalGeneration(

However, after moving files that is related to tokenizers, and fix some code, I can get no errors. Below are fixed code and changed repo tree

model = T5ForConditionalGeneration.from_pretrained(
        model,
        local_files_only=True
    )
    model.to(device)
    tokenizer = T5Tokenizer.from_pretrained(
        tokenizer_path, 
        trust_remote_code=True, 
        TOKENIZERS_PARALLELISM=True,
        local_files_only=True,
        skip_special_tokens=True
    )

in tokenizer_path

tokenizer_path
ㄴ special_tokens_map.json
ㄴ spiece.model
ㄴ toeknizer.json
ㄴ tokenizer_config.json

Therefore, I Guess tokenizer.from_pretrained() method is reading config.json other than tokenizer_config.json. If I am right, can you fix this feature in the following release? (It seems If there exist "confing.json" and "tokenizer_config.json" at the same time, "config.json" wins at all) Thanks for reading my issue!

ArthurZucker commented 3 months ago

The T5Tokenizer is unrelated to tokenizers. cc @itazap if you are able to reproduce with "t5-base"

itazap commented 2 months ago

Hi @daehuikim ! Regarding your code-snippet not sure if it was only for the purpose of the code-snippet, but is your first model variable referring to the path? and then being overwritten by the model object itself?

from_pretrained expects:


Args:
            pretrained_model_name_or_path (`str` or `os.PathLike`):
                Can be either:

                - A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co.
                - A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved
                  using the [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`] method, e.g.,
                  `./my_model_directory/`.
                - (**Deprecated**, not applicable to all derived classes) A path or url to a single saved vocabulary
                  file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g.,
                  `./my_model_directory/vocab.txt`.

Maybe I misunderstood your question! Please let me know! 😊

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.