humanely commented 4 months ago

System Info

transformers version: 4.40.1
Platform: Linux-5.15.0-1053-gcp-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.1
Accelerate version: 0.26.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.1.0+cu118 (True)
Tensorflow version (GPU?): 2.13.1 (True)
Flax version (CPU?/GPU?/TPU?): 0.7.2 (cpu)
Jax version: 0.4.13
JaxLib version: 0.4.13
Using GPU in script?: YES
Using distributed or parallel set-up in script?: NO

Who can help?

@ArthurZucker CLIP trained and saved using the run_clip.py can't be loaded by the CLIPProcessor. I believe that this is a bug.

Information

[X] The official example scripts
[x] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

processor = CLIPProcessor.from_pretrained(path, tokenizer)

Where path is the training_args.output_dir of args.


python examples/pytorch/contrastive-image-text/run_clip.py \
    --output_dir ./clip-roberta-finetuned \
    --model_name_or_path ./clip-roberta \
    --data_dir $PWD/data \
    --dataset_name ydshieh/coco_dataset_script \
    --dataset_config_name=2017 \
    --image_column image_path \
    --caption_column caption \
    --remove_unused_columns=False \
    --do_train  --do_eval \
    --per_device_train_batch_size="64" \
    --per_device_eval_batch_size="64" \
    --learning_rate="5e-5" --warmup_steps="0" --weight_decay 0.1 \
    --overwrite_output_dir \
    --push_to_hub

Expected behavior

The CLIP saved by these lines of the code, should have been loaded by the CLIPProcessor.

tokenizer.save_pretrained(training_args.output_dir)
 image_processor.save_pretrained(training_args.output_dir)

Also, the error message directions fail to load the tokenizer from tokenizer.json or directory or CLIP or the LM.

amyeroberts commented 4 months ago

Hi @humanely, thanks for raising this issue!

Indeed, the issue here is that the script was written to use the tokenizer and image processor separately, and the respective objects are never tied together in a processor in the script.

I've opened a PR to address this - #30720

humanely commented 4 months ago

Thanks @amyeroberts.

Is there an alternative for now? I tried loading legacy tokenizers(with merges and vocab file), which works fine. But loading tokenizer.json(FAST version) doesn't. The error is: ValueError: Thebackend_tokenizerprovided does not match the expected format. The CLIP tokenizer has been heavily modified from transformers version 4.17.0. You need to convert the tokenizer you are using to be compatible with this version.The easiest way to do so isCLIPTokenizerFast.from_pretrained("path_to_local_folder_or_hub_repo, from_slow=True). If you want to use your existing tokenizer, you will have to revert to a version prior to 4.17.0 of transformers.

amyeroberts commented 4 months ago

@humanely If you have a saved tokenizer and image processor, you can load them then create a processor, which you can then save out. This processor can then be loaded using the normal API:

from transformers import AutoProcessor, AutoImageProcessor, AutoTokenizer, CLIPProcessor

tokenizer = AutoTokenizer.from_pretrained(training_args.output_dir)
image_processor = AutoImageProcessor.from_pretrained(training_args.output_dir)
processor = CLIPProcessor(tokenizer=tokenizer, image_processor=image_processor)

# Save out the processor
processor.save_pretrained(training_args.output_dir)

# Now you have a processor you can load
new_processor = AutoProcessor.from_pretrained(training_args.output_dir)

humanely commented 4 months ago

I get this error with this approach.


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/prabhatkr/miniforge-pypy3/envs/clip/lib/python3.12/site-packages/transformers/models/clip/processing_clip.py", line 59, in __init__
    super().__init__(image_processor, tokenizer)
  File "/home/prabhatkr/miniforge-pypy3/envs/clip/lib/python3.12/site-packages/transformers/processing_utils.py", line 96, in __init__
    raise ValueError(
ValueError: Received a PreTrainedTokenizerFast for argument tokenizer, but a ('CLIPTokenizer', 'CLIPTokenizerFast') was expected.

Also tried with GPT2 Tokenizer. And a smilar error occured.


>>> from transformers import GPT2Tokenizer
>>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
>>> loc="/Users/home/clip-model"
>>> from transformers import AutoProcessor, AutoImageProcessor, AutoTokenizer, CLIPProcessor
>>> image_processor = AutoImageProcessor.from_pretrained(loc)
>>> processor = CLIPProcessor(tokenizer=tokenizer, image_processor=image_processor)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/transformers/models/clip/processing_clip.py", line 59, in __init__
    super().__init__(image_processor, tokenizer)
  File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/transformers/processing_utils.py", line 96, in __init__
    raise ValueError(
ValueError: Received a GPT2Tokenizer for argument tokenizer, but a ('CLIPTokenizer', 'CLIPTokenizerFast') was expected.

humanely commented 4 months ago

This worked. But I am not sure if it is valid. https://github.com/huggingface/tokenizers/issues/521

Basically, load the tokenizer json and save as legacy.


tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json)
tokenizer.model.save(output_dir)

But this only gives vocab.json and merges.txt. To get 2 more files of special_tokens_map.json, tokenizer_config.json;

Load the tokenizer json in Autotokenizer and save in a separate directory. Only use the necessary 2 files.

Is this correct approach in your opinion? @amyeroberts

Also, I would suggest to upgrade the CLIPTokenizer class to use the Fast and new type of tokenizers out of the box.

humanely commented 4 months ago

Some more issues of legacy exists in Processor.

inputs = processor(text=["cat","dog"], images=image, return_tensors="pt", padding=True)

Traceback (most recent call last): File "", line 1, in File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/transformers/models/clip/processing_clip.py", line 106, in call encoding = self.tokenizer(text, return_tensors=return_tensors, tokenizer_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2858, in call encodings = self._call_one(text=text, text_pair=text_pair, all_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2944, in _call_one return self.batch_encode_plus( ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3135, in batch_encode_plus return self._batch_encode_plus( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus encodings = self._tokenizer.encode_batch( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Exception: Unk token <|endoftext|> not found in the vocabulary


Whereas, the tokenizer has it:

"added_tokens": [ { "id": 0, "content": "[UNK]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true },...



Legacy tokenizers have the <|endoftext|>, but not in new types.

amyeroberts commented 4 months ago

I get this error with this approach.

@humanely Ah, yes, I was assuming you were using a CLIPTokenizer. If you want to bundle together a tokenizer and a image prcoessor which aren't for CLIP, you can't bundle them together into a CLIPProcessor, as it assume CLIP objects are being used.

I closed my previous PR as I realised I was making the assumption the script was just for CLIP. However, looking at the script README.md I see that the idea is any vision and language models can be combined i.e. any tokenizer and image processor could be used

amyeroberts commented 4 months ago

@humanely There's a few different questions here. Without knowing exactly what you're trying to do, and a full reproducer, it's hard to help debug or give advice. Just from what I gather in the comments:

Basically, load the tokenizer json and save as legacy.

Why save it as legacy?

Load the tokenizer json in Autotokenizer and save in a separate directory. Only use the necessary 2 files.

Yes, the recommended way to save and load the tokenizers is using AutoTokenizer and the save_pretrained and from_pretrained arguments

I'm not sure what you mean by 'save in a separate directory` (separate from what?) and 'only use the necessary 2 files'/

Also, I would suggest to upgrade the CLIPTokenizer class to use the Fast and new type of tokenizers out of the box.

The CLIPTokenizer class is for the slow tokenizer. CLIPTokenizerFast is for the fast tokenizer. AutoTokenizer will correctly load the fast by default if it exists and can be loaded in the environment, otherwise it will load the slow tokenizer.

Legacy tokenizers have the <|endoftext|>, but not in new types.

Which "legacy" tokenizers are we talking about here?

Where a tokenizer has <|endoftext|> will depend on its vocabulary. Some tokenizers will have it, some will not. This will be a modelling decision.

humanely commented 4 months ago

My bad. I read the docs and understood that CLIPTokenizer seemed generic or bivalent. Is there a reason for keeping CLIP based off Slow and not Fast?

amyeroberts commented 4 months ago

Is there a reason for keeping CLIP based off Slow and not Fast?

I'm not sure where this assumption is coming from, a fast clip tokenizer exists.

humanely commented 4 months ago

Thanks

humanely commented 4 months ago

Is there a documentation of how to train a Fast clip tokenizer? I built one which works fine as pretrained fast one. But fails to work as CLIP. I have attached the tokenizer file. It is for Sanskrit language.


t=PreTrainedTokenizerFast(tokenizer_file="sa-bpe-tokenizer-v1.4.json")
t.decode(t.encode("एकः बालकः धावति"))
'एकः बालकः धावति'

Firstly, I am unable to load this tokenizer as CLIP. Even if I generate vocab and merges and load as CLIP, the encoder only generates UNK.

Save vocab and merges in sa-bpe-tokenizer-v1.4 folder.


 c=CLIPTokenizerFast.from_pretrained("sa-bpe-tokenizer-v1.4", from_slow=True)
 c.decode(c.encode("एकः बालकः धावति"))
'<|startoftext|>ए�[UNK]�[UNK]�[UNK]�[UNK]ल�[UNK]�[UNK]�[UNK]�[UNK]व�[UNK]�[UNK]<|endoftext|>'

This is not a CLIPTokenizer issue. The process to get vocab.json and merges files from the attached tokenizer JSON seems to be wrong. Can someone help in converting this FAST tokenizer to a compatible CLIP tokenizer? Or, is there a way to build the CLIP tokenizer from scratch?

TIA

sa-bpe-tokenizer-v1.4.json

huggingface / transformers

CLIPProcessor is not loading the saved Processor of the same version #30714

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Save vocab and merges in sa-bpe-tokenizer-v1.4 folder.