Enable Multi Layer Perceptron (MLP) selection for projector

First of all, thank you for creating such an amazing project! This repository has become very useful for me.

Changes

Now, I have made a modification to the code to allow the projector to be a Multi Layer Perceptron (MLP) when model_type: git_llm is selected. Previously, when using model_type: git_llm, a single Linear layer was applied as the projector that connects the Vision model and the LLM. However, inspired by LLaVA v1.5 【Liu+'23 Improved Baselines with Visual Instruction Tuning】, I have added code that makes it possible to vary the number of these Linear layers simply by adding an option (mlp_adapter) in projects/OOO/OO.yml under model_config. The main details of the code for changing the projector to an MLP can be understood by looking at heron/models/mlp_adapter.py. Furthermore, this code references the github implementation of LLaVA v1.5 ( https://github.com/haotian-liu/LLaVA/blob/785f766fcddc86ffeaa62cd51cf7834a11c04e6d/llava/model/multimodal_projector/builder.py#L33 ).

Also, to maintain compatibility, I've made sure it works the same way as before with the existing projects/OOO/OO.yml.

For example, if you use projects/llama/exp001.yml as it is,

training_config:
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 4
  num_train_epochs: 1
  dataloader_num_workers: 16
  fp16: true
  optim: "adamw_torch"
  learning_rate: 5.0e-5
  logging_steps: 100
  evaluation_strategy: "steps"
  save_strategy: "steps"
  eval_steps: 4000
  save_steps: 4000
  save_total_limit: 1
  deepspeed: ./configs/deepspeed/ds_config_zero1.json
  output_dir: ./output/
  report_to: "wandb"

model_config:
  fp16: true
  pretrained_path: # None or path to model weight
  model_type: git_llm
  language_model_name: meta-llama/Llama-2-7b-chat-hf
  vision_model_name: openai/clip-vit-base-patch16
  num_image_with_embedding: 1 # if 1, no img_temporal_embedding
  max_length: 512
  keys_to_finetune:
    - visual_projection
    - num_image_with_embedding
  keys_to_freeze: []

  use_lora: true
  lora:
    r: 8
    lora_alpha: 32
    target_modules:
      - q_proj
      - k_proj
      - v_proj
    lora_dropout: 0.01
    bias: none
    task_type: CAUSAL_LM

dataset_config_path:
  - ./configs/datasets/m3it.yaml

As before, a single layer Linear layer will be applied as the projector.

If you want to change the projector to an MLP, add the mlp_adapter item to model_config in projects/llama/exp001.yml and give it the name mlp2x_gelu.

training_config:
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 4
  num_train_epochs: 1
  dataloader_num_workers: 16
  fp16: true
  optim: "adamw_torch"
  learning_rate: 5.0e-5
  logging_steps: 100
  evaluation_strategy: "steps"
  save_strategy: "steps"
  eval_steps: 4000
  save_steps: 4000
  save_total_limit: 1
  deepspeed: ./configs/deepspeed/ds_config_zero1.json
  output_dir: ./output/
  report_to: "wandb"

model_config:
  fp16: true
  pretrained_path: # None or path to model weight
  model_type: git_llm
  mlp_adapter: mlp2x_gelu # projector will be a 2-layer MLP.
  language_model_name: meta-llama/Llama-2-7b-chat-hf
  vision_model_name: openai/clip-vit-base-patch16
  num_image_with_embedding: 1 # if 1, no img_temporal_embedding
  max_length: 512
  keys_to_finetune:
    - visual_projection
    - num_image_with_embedding
  keys_to_freeze: []

  use_lora: true
  lora:
    r: 8
    lora_alpha: 32
    target_modules:
      - q_proj
      - k_proj
      - v_proj
    lora_dropout: 0.01
    bias: none
    task_type: CAUSAL_LM

dataset_config_path:
  - ./configs/datasets/m3it.yaml

In the above example, by adding mlp_adapter: mlp2x_gelu under model_config, the projector will become a 2-layer MLP, but if you want it to be 3 layers, simply changing to mlp_adapter: mlp3x_gelu will make it a 3-layer MLP easily!

turingmotors / heron

Enable Multi Layer Perceptron (MLP) selection for projector #25

Enable Multi Layer Perceptron (MLP) selection for projector

Changes