huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.82k stars 26.48k forks source link

transformers.pipeline does not load tokenizer passed as string for custom models #31669

Closed chedatomasz closed 1 month ago

chedatomasz commented 3 months ago

System Info

Who can help?

@Narsil

Information

Tasks

Reproduction

  1. Locate a model which
    • isn't present in TOKENIZER_MAPPING
    • doesn't specify model_config.tokenizer_class
    • nevertheless has a tokenizer on the hub, loadable with AutoTokenizer These requirements means this happens for custom models only (not integrated into the library), AFAIK. Running those requires trust_remote_code=True, so it might be wise to create your own example meeting these requirements. I will be using "tcheda/mot_test".
  2. Verify that code works properly where tokenizer and model are passed as pre-instantiated objects
    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained("tcheda/mot_test", trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained("tcheda/mot_test", trust_remote_code=True)
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
  3. Attempt to create a pipeline, specifying both the model and the tokenizer as strings.
    from transformers import pipeline
    pipe = pipeline("text-generation", model="tcheda/mot_test", tokenizer="tcheda/mot_test", trust_remote_code=True)
  4. The code crashes immediately with:
    
    [usr/local/lib/python3.10/dist-packages/transformers/pipelines/__init__.py](https://localhost:8080/#) in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    1106         kwargs["device"] = device
    1107 
    -> 1108     return pipeline_class(model=model, framework=framework, task=task, **kwargs)

/usr/local/lib/python3.10/dist-packages/transformers/pipelines/text_generation.py in init(self, *args, kwargs) 94 95 def init(self, *args, *kwargs): ---> 96 super().init(args, kwargs) 97 self.check_model_type( 98 TF_MODEL_FOR_CAUSAL_LM_MAPPING_NAMES if self.framework == "tf" else MODEL_FOR_CAUSAL_LM_MAPPING_NAMES

/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py in init(self, model, tokenizer, feature_extractor, image_processor, modelcard, framework, task, args_parser, device, torch_dtype, binary_output, **kwargs) 895 self.tokenizer is not None 896 and self.model.can_generate() --> 897 and self.tokenizer.pad_token_id is not None 898 and self.model.generation_config.pad_token_id is None 899 ):

AttributeError: 'str' object has no attribute 'pad_token_id'


5. Explanation
The bug is probably at https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/pipelines/__init__.py#L907
`load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None`
This isn't set when `tokenizer` is a string, so the tokenizer initialization block (which contains proper handling of the string case) at https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/pipelines/__init__.py#L907 isn't entered at all.

### Expected behavior

The pipeline should be created correctly, loading the tokenizer as if with AutoTokenizer.from_pretrained(tokenizer). This is the behaviour described in the docs.
amyeroberts commented 3 months ago

cc @Rocketknight1

LysandreJik commented 2 months ago

Gentle ping @Rocketknight1

Rocketknight1 commented 2 months ago

PR with the fix is open at #32300!