[Question] Pickle error when perform SQA stage2 finetune

JulioZhao97 commented 1 year ago

Question

Heloo, thanks on your great work!

I want to ask where is the llava-13b-pretrain-no_im_start_end_token.bin model? I go into this repo as you said, but found LLaVA-13b-pretrain-projector-v0-CC3M-595K-original_caption-no_im_token.bin.

Then I finetune using following command:

srun -p llm_exp --gres=gpu:8 --quotatype=auto torchrun --nnodes=1 --nproc_per_node=8 --master_port=$RANDOM \
    llava/train/train_mem.py \
    --model_name_or_path ~/MiniGPT-4/vicuna_weight \
    --data_path ./sqa/llava_train_QCM-LEPA.json \
    --image_folder ./sqa/train \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ./checkpoints/mm_projector/LLaVA-13b-pretrain-projector-v0-CC3M-595K-original_caption-no_im_token.bin \
    --mm_vision_select_layer -2 \
    --bf16 True \
    --output_dir ./checkpoints/llava-13b-pretrain-no_im_start_end_token-finetune_scienceqa \
    --num_train_epochs 12 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

It gives this error:

Traceback (most recent call last):
  File "/mnt/petrelfs/zhaozhiyuan/mllm/LLaVA/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/mnt/petrelfs/zhaozhiyuan/mllm/LLaVA/llava/train/train.py", line 439, in train
    mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location='cpu')
  File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/llava/lib/python3.10/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/llava/lib/python3.10/site-packages/torch/serialization.py", line 920, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

It seems that I load the wrong model, could you please provide llava-13b-pretrain-no_im_start_end_token.bin model?

Thanks so much!

haotian-liu commented 1 year ago

Hi, this is the correct checkpoint. Can you try downloading this again from here? I just tried downloading it myself, and I was able to load it successfully. Thanks.

JulioZhao97 commented 1 year ago

Hi, this is the correct checkpoint. Can you try downloading this again from here? I just tried downloading it myself, and I was able to load it successfully. Thanks.

Thanks, the problem is solved. The reason why this error occur is that I use wget to download checkpoint which is easy but silly. Thanks for your patience.

Can I ask you another question? I keep getting another error:

I checked this and some say it is a conflict between torch==1.12 and transformers==4.28.0

But after I upgrade torch to 1.13, another strange error occurs from flash-attention:

So can I ask what is your environment? cuda/torch/transformers/ version? Thanks so much

haotian-liu commented 1 year ago

Hi I use CUDA 11.7 and PyTorch 2.0. This is an environment that a user provides that works with PyTorch 1.13.1: https://github.com/haotian-liu/LLaVA/issues/102#issuecomment-1537465794.

Please note that the transformers version should be this.

I would recommend create a new environment and reinstall everything, which may be easier.

JulioZhao97 commented 1 year ago

Hi I use CUDA 11.7 and PyTorch 2.0. This is an environment that a user provides that works with PyTorch 1.13.1: #102 (comment).

Please note that the transformers version should be this.

I would recommend create a new environment and reinstall everything, which may be easier.

Thanks! I try it now, is this transformer version correct?

haotian-liu commented 1 year ago

yes it is correct.

JulioZhao97 commented 1 year ago

yes it is correct.

Finally, I am able to perform finetune model using cuda==11.7 and torch==2.1.0, thansk for your patience Liu !

haotian-liu / LLaVA

[Question] Pickle error when perform SQA stage2 finetune #130

Question