clip-vit-large-patch14 image classification support

skaulintel commented 1 year ago

Feature request

I am trying to run the following model using the optimum-habana repository https://huggingface.co/openai/clip-vit-large-patch14. Do you have any suggestions for finetuning this model with existing Habana-optimum code? If not could we create a script/amend a current script so this model can be supported?

Motivation

We would like to train this particular model on Habana hardware.

Your contribution

I tried using the existing image_classification script, because it supports regular vit. Here is the command in the image classification README I used. However, I found that the AutoModelforImageClassification class invoked here does not support clip config (not in the list of configs here). So I tried swapping out this class for the generic AutoModel class where CLIPconfig is supported. I get the following error with that change:

hmp:opt_level O1 Traceback (most recent call last): File "run_image_classification.py", line 410, in main() File "run_image_classification.py", line 384, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/root/optimum/habana/transformers/trainer.py", line 397, in train return inner_training_loop( File "/root/optimum/habana/transformers/trainer.py", line 500, in _inner_training_loop self._load_optimizer_and_scheduler(resume_from_checkpoint) File "/root/optimum/habana/transformers/trainer.py", line 960, in _load_optimizer_and_scheduler self.optimizer.load_state_dict( File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 201, in load_state_dict raise ValueError("loaded state dict contains a parameter group " ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Upon further digging, I also see that CLIP may still be a WIP from transformers side. Do not see _MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES used anywhere. Let me know if this feature enablement is possible. Would be happy to work on it with some direction.

regisss commented 1 year ago

Hi @skaulintel! I think it comes from the fact that CLIP is a multi-modal model and deals with both images and text. Here is the example script to train it in Transformers. I'm going to quickly adapt it and will let you know when it is ready.

regisss commented 1 year ago

Solved in #168.

huggingface / optimum-habana