transformers.pipeline does not load tokenizer passed as string for custom models

chedatomasz commented 3 months ago

System Info

transformers version: 4.42.1
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (False)
Tensorflow version (GPU?): 2.15.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.8.4 (cpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Using distributed or parallel set-up in script?: no

Who can help?

@Narsil

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

Locate a model which
- isn't present in TOKENIZER_MAPPING
- doesn't specify model_config.tokenizer_class
- nevertheless has a tokenizer on the hub, loadable with AutoTokenizer These requirements means this happens for custom models only (not integrated into the library), AFAIK. Running those requires trust_remote_code=True, so it might be wise to create your own example meeting these requirements. I will be using "tcheda/mot_test".

Verify that code works properly where tokenizer and model are passed as pre-instantiated objects

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("tcheda/mot_test", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tcheda/mot_test", trust_remote_code=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Attempt to create a pipeline, specifying both the model and the tokenizer as strings.

from transformers import pipeline
pipe = pipeline("text-generation", model="tcheda/mot_test", tokenizer="tcheda/mot_test", trust_remote_code=True)

The code crashes immediately with:


[usr/local/lib/python3.10/dist-packages/transformers/pipelines/__init__.py](https://localhost:8080/#) in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
1106         kwargs["device"] = device
1107 
-> 1108     return pipeline_class(model=model, framework=framework, task=task, **kwargs)

/usr/local/lib/python3.10/dist-packages/transformers/pipelines/text_generation.py in init(self, *args, kwargs) 94 95 def init(self, *args, *kwargs): ---> 96 super().init(args, kwargs) 97 self.check_model_type( 98 TF_MODEL_FOR_CAUSAL_LM_MAPPING_NAMES if self.framework == "tf" else MODEL_FOR_CAUSAL_LM_MAPPING_NAMES

/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py in init(self, model, tokenizer, feature_extractor, image_processor, modelcard, framework, task, args_parser, device, torch_dtype, binary_output, **kwargs) 895 self.tokenizer is not None 896 and self.model.can_generate() --> 897 and self.tokenizer.pad_token_id is not None 898 and self.model.generation_config.pad_token_id is None 899 ):

AttributeError: 'str' object has no attribute 'pad_token_id'


5. Explanation
The bug is probably at https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/pipelines/__init__.py#L907
`load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None`
This isn't set when `tokenizer` is a string, so the tokenizer initialization block (which contains proper handling of the string case) at https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/pipelines/__init__.py#L907 isn't entered at all.

### Expected behavior

The pipeline should be created correctly, loading the tokenizer as if with AutoTokenizer.from_pretrained(tokenizer). This is the behaviour described in the docs.

amyeroberts commented 3 months ago

cc @Rocketknight1

LysandreJik commented 2 months ago

Gentle ping @Rocketknight1

Rocketknight1 commented 2 months ago

PR with the fix is open at #32300!

huggingface / transformers