Fail to load CLIP with deepspeed

liuemi001 commented 8 months ago

I am using finetune_lora.sh with zero3_offload.json to train (context below) and get the following error.

Traceback (most recent call last):
  File "/deep/u/emily712/GeoChat/geochat/train/train_mem.py", line 13, in <module>
    train()
  File "/deep/u/emily712/GeoChat/geochat/train/train.py", line 886, in train
    model.get_model().initialize_vision_modules(
  File "/deep/u/emily712/GeoChat/geochat/model/geochat_arch.py", line 62, in initialize_vision_modules
        model.get_model().initialize_vision_modules(model.get_model().initialize_vision_modules(

  File "/deep/u/emily712/GeoChat/geochat/model/geochat_arch.py", line 62, in initialize_vision_modules
  File "/deep/u/emily712/GeoChat/geochat/model/geochat_arch.py", line 62, in initialize_vision_modules
    vision_tower.load_model()
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 103, in load_model
    vision_tower.load_model()
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 103, in load_model
    vision_tower.load_model()
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 103, in load_model
    self.clip_interpolate_embeddings(image_size=504, patch_size=14)
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 30, in clip_interpolate_embeddings
    self.clip_interpolate_embeddings(image_size=504, patch_size=14)
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 30, in clip_interpolate_embeddings
    n, seq_length, hidden_dim = pos_embedding.shape
ValueError    n, seq_length, hidden_dim = pos_embedding.shape: 
not enough values to unpack (expected 3, got 2)
ValueError: not enough values to unpack (expected 3, got 2)
    self.clip_interpolate_embeddings(image_size=504, patch_size=14)
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 30, in clip_interpolate_embeddings
    n, seq_length, hidden_dim = pos_embedding.shape
ValueError: not enough values to unpack (expected 3, got 2)

Further examination shows that the issue is the CLIP weights are not loaded at the time of positional interpolation. When I load CLIP via CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14-336") within deepspeed, none of the model weights are loaded (i.e. they are a tensor of size zero). Running

from transformers import CLIPVisionModel
vision_tower = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14-336")
state_dict = vision_tower.vision_model.embeddings.position_embedding.state_dict()
pos_embedding = state_dict['weight']
print("pos embedding shape: ", pos_embedding.shape)

with deepspeed within the CLIPVisionTower.load_model() method prints torch.Size([0]), whereas running the same lines of code within a program without deepspeed or within a python shell yields torch.Size([577, 1024]), which is the correct size.

Expected behavior pos_embedding should have shape torch.Size([577, 1024]).

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/deep/group/aicc-bootcamp/packages/miniconda3/envs/vllava/lib/python3.9/site-packages/torch'] torch version .................... 2.0.1+cu117 deepspeed install path ........... ['/deep/group/aicc-bootcamp/packages/miniconda3/envs/vllava/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.13.1, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7 shared memory (/dev/shm) size .... 251.77 GB

System info (please complete the following information):

OS: Ubuntu 20.04.3 LTS
1 machine with x3 A4000s
Python version: 3.9.16

KjAeRsTuIsK commented 8 months ago

Hi @liuemi001 , can you try with zero2 once to train. I faced a similar issue and it works with zero2 config.

liuemi001 commented 8 months ago

Hi @KjAeRsTuIsK, it works for me with zero2 as well, but because of our compute constraints, we are forced to use zero3 and zero3_offload. Is it possible to get it working with these configs as well? Thanks!

kartikey9254 commented 5 months ago

Hi @liuemi001 , can you try with zero2 once to train. I faced a similar issue and it works with zero2 config.

hi there , i am trying out this model and the demo worked but when i used the lora.sh script for training it displays OSError: Error no file named pytorch Model. bin, tf Model. h5, model. ckpt. index or flex_ Model. msgpack found in directory/home/LaVA/lava v1.5-13b lora . can you guide me how can i train this model ?

732259408 commented 4 months ago

hi, I am using zero3 for training, and I also encountered the error: ValueError: not enough values to unpack (expected 3, got 2). are you resolve this issue? How can I resolve this issue? @liuemi001

mbzuai-oryx / GeoChat

Fail to load CLIP with deepspeed #12