magic-research / PLLaVA

Official repository for the paper PLLaVA
593 stars 40 forks source link

No output text when evaluating on my own pre-trained model. #45

Closed gaowei724 closed 5 months ago

gaowei724 commented 6 months ago

Hello, I have encountered a problem similar to issue 43 I used the author-provided pllava-7b as the pre-trained model on my own prepared video Q&A dataset to continue finetuning (during training, I activated the projector and lm but froze the vm, using the default lora configuration for training). Throughout the training process, the video-loss consistently decreased. After training with the original code, I found that I would get three types of folders, such as: ckpt_epoch100, pretrained_epoch100, and the folder pretrained_step100. Reading the code, I suspect that ./pretrained_epoch100 contains the saved projector and the large model's lora parameters. I executed the following command for evaluation on MVBench:

python tasks/eval/mvbench/pllava_eval_mvbench.py --pretrained_model_name_or_path MODELS/pllava-7b \
--save_path test_results/test_pllava_7b/mvbench --use_lora --lora_alpha 32 --num_frames 16 \
--weight_dir pretrained_epoch100 --conv_mode eval_mvbench

However, I found that the accuracy is always 0, as the output text is consistently empty. After some debugging and checking, I verified my lora configuration (the same as during training, lora_alpha=32), and I suspect that my lora training has collapsed, so I tried setting the lora_alpha to 0, but that was of no use. I'm confused and don't know if my training has crashed or if I've forgotten some critical hyperparameters or misunderstood the evaluation process. Can you provide me with some clues?

My training configuration is as follows:

"train_file": [
    [
      "xxx/Code/PLLaVA/DATAS/TRAIN_TEST/magic_jsons/video/reasoning/system3/train.json",
      "/lpai/volumes/autopilot-perception-ai-lf/gaowei/Code/PLLaVA/DATAS/TRAIN_TEST/videos/system3",
      "video"
    ]
  ],
  "test_file": {},
  "test_types": [],
  "num_workers": 8,
  "save_steps": 625,
  "ckpt_steps": 62,
  "stop_key": null,
  "deepspeed": false,
  "num_frames": 16,
  "num_frames_test": 1,
  "batch_size": 16,
  "gradient_accumulation_steps": 1,
  "max_txt_l": 512,
  "max_train_steps": null,
  "pre_text": false,
  "gradient_checkpointing": true,
  "inputs": {
    "image_res": 336,
    "video_input": {
      "num_frames": 16,
      "sample_type": "rand",
      "num_frames_test": 1,
      "sample_type_test": "middle",
      "random_aug": false
    },
    "max_txt_l": {
      "image": 512,
      "video": 512
    },
    "batch_size": {
      "image": 16,
      "video": 16
    },
    "batch_size_test": {
      "image": 16,
      "video": 16
    }
  },
  "model": {
    "repo_id": "xxx/Code/PLLaVA/MODELS/llava-v1.6-vicuna-7b-hf",
    "pretrained_path": "xxx/Code/PLLaVA/MODELS/pllava-7b",
    "load_from_origin": false,
    "origin_vision": "",
    "origin_llm": "",
    "vision_encoder": {
      "name": "vit_l14"
    },
    "torch_dtype": "bfloat16",
    "freeze_projector": false,
    "freeze_lm": false,
    "freeze_vision_tower": true,
    "lora_target_modules": [
      "q_proj",
      "v_proj"
    ],
    "use_lora": true,
    "lora_r": 128,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "num_frames": 16,
    "pooling_method": "avg",
    "use_pooling": true,
    "frame_shape": [
      24,
      24
    ],
    "pooling_shape": [
      16,
      12,
      12
    ]
  },
  "preprocess": {
    "system": "",
    "mm_alone": false,
    "random_shuffle": false,
    "add_second_msg": false,
    "roles": [
      "USER:",
      "ASSISTANT:"
    ],
    "end_signal": [
      " ",
      "</s>"
    ],
    "begin_signal": "",
    "dataset_image_placeholder": "<Image></Image>",
    "dataset_video_placeholder": "<Video></Video>",
    "image_token_index": 32000,
    "max_txt_l": 512,
    "ignore_index": -100,
    "center_pad": false,
    "longest_edge": 762,
    "shortest_edge": 336,
    "clip_transform": false,
    "num_frames": 16
  },
  "optimizer": {
    "opt": "adamW",
    "lr": 2e-05,
    "opt_betas": [
      0.9,
      0.999
    ],
    "weight_decay": 0.02,
    "max_grad_norm": -1,
    "different_lr": {
      "enable": false,
      "module_names": [],
      "lr": 0.001
    }
  },
  "scheduler": {
    "is_videochat2_custom": true,
    "sched": "cosine",
    "epochs": 300,
    "warmup_ratio": 0.2,
    "min_lr_multi": 0.25
  },
  "evaluate": false,
  "deep_fusion": false,
  "evaluation": {
    "eval_frame_ensemble": "concat",
    "eval_x_only": false,
    "k_test": 128,
    "eval_offload": true
  },
  "fp16": true,
  "wandb": {
    "enable": false,
    "entity": "user",
    "project": "videochat2"
  },
  "dist_url": "env://",
  "device": "cuda",
  "mode": "it",
  "output_dir": "/lpai/output/models/",
  "tensorboard_dir": "/lpai/output/",
  "resume": false,
  "debug": false,
  "log_freq": 5,
  "log_epoch_freq": 100,
  "metric_window_size": 10,
  "seed": 42,
  "report_to": "tensorboard",
  "save_latest": true,
  "auto_resume": true,
  "pretrained_path": "",
  "rank": 0,
  "world_size": 8,
  "gpu": 0,
  "distributed": true,
  "dist_backend": "nccl"
}

My inference evaluation command is as follows:

python tasks/eval/mvbench/pllava_eval_mvbench.py --pretrained_model_name_or_path MODELS/pllava-7b \
--save_path test_results/test_pllava_7b/mvbench --use_lora --lora_alpha 32 --num_frames 16 \
--weight_dir pretrained_epoch100 --pooling_shape 16-12-12 --conv_mode eval_mvbench

The results of the evaluation are as follows:

root@pllava-0:xxx/Code/PLLaVA#  /usr/bin/env /usr/bin/python3 /root/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 59113 -- xxx/Code/PLLaVA/tasks/eval/mvbench/pllava_eval_mvbench.py --pretrained_model_name_or_path MODELS/pllava-7b --save_path test_results/test_pllava_7b/mvbench --num_frames 16 --use_lora --lora_alpha 16 --weight_dir xxx/models/pllava/gw-pllava-24-05-21-6352/pretrained_epoch100 --pooling_shape 16-12-12 --conv_mode eval_mvbench 
INFO:__main__:loading model and constructing dataset to gpu 0...
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.38it/s]
Some weights of the model checkpoint at MODELS/pllava-7b were not used when initializing PllavaForConditionalGeneration: ...)
INFO:__main__:done loading llava
INFO:__main__:done model and dataset...
INFO:__main__:constructing dataset...
INFO:__main__:single test...
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:397: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(

### PROMPTING LM WITH:  Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, select the best option that accurately addresses the question.
 USER: <image>
 USER: Describe the video in details. ASSISTANT:
### LM OUTPUT TEXT:   Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, select the best option that accurately addresses the question.
 USER:  
 USER: Describe the video in details. ASSISTANT:
INFO:__main__:single test done...
  0%|                                                                                                                       | 0/4000 [00:00<?, ?it/s]model doesn't follow instructions: (
One Chunk--Task Type: Action Sequence, Chunk Part  Acc: 0.00%; Chunk Total Acc: 0.00%:   0%|                     | 1/4000 [00:37<41:08:36, 37.04s/it]model doesn't follow instructions: (
One Chunk--Task Type: Action Sequence, Chunk Part  Acc: 0.00%; Chunk Total Acc: 0.00%:   0%|                     | 2/4000 [01:12<40:25:34, 36.40s/it]model doesn't follow instructions: (
One Chunk--Task Type: Action Sequence, Chunk Part  Acc: 0.00%; Chunk Total Acc: 0.00%:   0%|                     | 3/4000 [01:50<40:45:12, 36.71s/it]

The trainning tensorboard log:

screenshot-20240521-221418

ermu2001 commented 6 months ago

Hi, If you haven't use Deespeed for training, then the model saved at pretrained_epoch100 should only contains the lora weights and projector weights, check if that's the case. In this situation, the demo would load from MODELS/pllava-7b (Which wasn't compatible with the original non-lora PllavaModel's from_pretrained method) and the language model would be initialized. Next it loads the weights in pretrained_epoch100, which also doesn't have the language models' weights.

In this case, I think you should set pretrained_model_name_or_path to llava-hf/llava-v1.6-vicuna-7b-hf if you are only doing lora training and projector training.

gaowei724 commented 6 months ago

Hi, If you haven't use Deespeed for training, then the model saved at pretrained_epoch100 should only contains the lora weights and projector weights, check if that's the case. In this situation, the demo would load from MODELS/pllava-7b (Which wasn't compatible with the original non-lora PllavaModel's from_pretrained method) and the language model would be initialized. Next it loads the weights in pretrained_epoch100, which also doesn't have the language models' weights.

In this case, I think you should set pretrained_model_name_or_path to llava-hf/llava-v1.6-vicuna-7b-hf if you are only doing lora training and projector training.

Thank you. Regarding your first question, I indeed did not use DeepSpeed; therefore, the pretrained_epoch100 part only saved lora_language and projector. The reason is because my model training settings was(and without deepspeed):

"model": {
    "repo_id": "xxx/gaowei/Code/PLLaVA/MODELS/llava-v1.6-vicuna-7b-hf",
    "pretrained_path": "xxx/Code/PLLaVA/MODELS/pllava-7b",
    "load_from_origin": false,
    "origin_vision": "",
    "origin_llm": "",
    "vision_encoder": {
      "name": "vit_l14"
    },
    "torch_dtype": "bfloat16",
    "freeze_projector": false,
    "freeze_lm": false,
    "freeze_vision_tower": true,
    "lora_target_modules": [
      "q_proj",
      "v_proj"
    ],
    "use_lora": true,
    "lora_r": 128,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "num_frames": 16,
    "pooling_method": "avg",
    "use_pooling": true,
    "frame_shape": [
      24,
      24
    ],
    "pooling_shape": [
      16,
      12,
      12
    ]
  },

When trainning the model, I actually first loaded MODELS/llava-v1.6-vicuna-7b-hf and then loaded MODELS/pllava-7b, so my first attempt at using the inference command was (try to load lm and vm's prams from pllava-7b):

python tasks/eval/mvbench/pllava_eval_mvbench.py --pretrained_model_name_or_path MODELS/pllava-7b \
--save_path test_results/test_pllava_7b/mvbench --use_lora --lora_alpha 32 --num_frames 16 \
--weight_dir pretrained_epoch100 --conv_mode eval_mvbench

That is, I first loaded MODELS/pllava-7b and then loaded my trained pretrained_epoch100. Thanks to your hint, I re-examined the output log and found a message in the log stating:

Some weights of PllavaForConditionalGeneration were not initialized from the model checkpoint at MODELS/pllava-7b and are newly initialized: ['language_model.lm_head.weight' .....

This should be as you mentioned, "pllava-7b wasn't compatible with the original non-lora PllavaModel's from_pretrained method", so some of the lm parameters were re-initialized.

Following your guidance, I changed the evaluation command to:

python tasks/eval/mvbench/pllava_eval_mvbench.py --pretrained_model_name_or_path MODELS/llava-v1.6-vicuna-7b-hf \
--save_path test_results/test_pllava_7b/mvbench --use_lora --lora_alpha 32 --num_frames 16 \
--weight_dir pretrained_epoch100 --conv_mode eval_mvbench

This time, the output text was no longer empty, so I think your guess was correct.

However, this command actually loads all the base_model exclusively from MODELS/llava-v1.6-vicuna-7b-hf, and the model I wanted to test is based on pllava-7b for fine-tuning, which is not consistent with my training process; I did not load any pllava-7b parameters. Could you tell me if MODELS/llava-v1.6-vicuna-7b-hf and pllava-7b, aside from the lora and projector parts, have exactly the same lm and vm? If so, skipping the loading of pllava-7b is reasonable.

ermu2001 commented 6 months ago

Yep the base weights in llava-1.6 is the same as pllava. We did not train the Base parts of Language Model and Vision Models.

liuao743 commented 5 months ago

Yep the base weights in llava-1.6 is the same as pllava. We did not train the Base parts of Language Model and Vision Models.

I do full fintuning on all layers of the network, including the vision encoder, projection layer, and language model. I don't use deepspeed. In the checkpoint_dir, I get .pt files as follows: 1716519539416 How can I use the my after-training weight ?

gaowei724 commented 5 months ago

Yep the base weights in llava-1.6 is the same as pllava. We did not train the Base parts of Language Model and Vision Models.

Thx.