huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.44k stars 26.65k forks source link

CLIP config inconsistency issue #29687

Closed zhiqiangdon closed 5 months ago

zhiqiangdon commented 7 months ago

System Info

Python Version: 3.9.18
Operating System: Linux
Platform Machine: x86_64
Platform Version: #91~18.04.1-Ubuntu SMP Sun Aug 14 01:24:43 UTC 2022
Pytorch Version: 2.0.1+cu117
transformers version: 4.38.2

Who can help?

No response

Information

Tasks

Reproduction

from transformers import AutoConfig
config1 = AutoConfig.from_pretrained("openai/clip-vit-large-patch14-336")
config2 = AutoConfig.from_pretrained("openai/clip-vit-base-patch32")
print(config1)
print(config2)

Expected behavior

The details of config1 and config2 are quite different. For example, config2 doesn't have the image size information.

Config1:

CLIPConfig {
  "_name_or_path": "openai/clip-vit-large-patch14-336",
  "architectures": [
    "CLIPModel"
  ],
  "initializer_factor": 1.0,
  "logit_scale_init_value": 2.6592,
  "model_type": "clip",
  "projection_dim": 768,
  "text_config": {
    "dropout": 0.0,
    "hidden_size": 768,
    "intermediate_size": 3072,
    "model_type": "clip_text_model",
    "num_attention_heads": 12,
    "projection_dim": 768
  },
  "torch_dtype": "float32",
  "transformers_version": "4.38.2",
  "vision_config": {
    "dropout": 0.0,
    "hidden_size": 1024,
    "image_size": 336,
    "intermediate_size": 4096,
    "model_type": "clip_vision_model",
    "num_attention_heads": 16,
    "num_hidden_layers": 24,
    "patch_size": 14,
    "projection_dim": 768
  }
}

Config2:

CLIPConfig {
  "_name_or_path": "openai/clip-vit-base-patch32",
  "architectures": [
    "CLIPModel"
  ],
  "initializer_factor": 1.0,
  "logit_scale_init_value": 2.6592,
  "model_type": "clip",
  "projection_dim": 512,
  "text_config": {
    "bos_token_id": 0,
    "dropout": 0.0,
    "eos_token_id": 2,
    "model_type": "clip_text_model"
  },
  "transformers_version": "4.38.2",
  "vision_config": {
    "dropout": 0.0,
    "model_type": "clip_vision_model"
  }
}

In transformers 4.31.0, clip base and clip large configs have more extensive information. In 4.38.2, the information becomes incomplete and inconsistent.

amyeroberts commented 7 months ago

Hi @zhiqiangdon,

When configs are saved out, only the parameters which are different from the default config values are saved. So in the case of config 1 - the image size is saved because it's 336, which is different from the default of 224, which config 2 uses.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.