Phi3.5-vision-instruct fine-tuning best practices. (Latex OCR Fine-tuning)

Jintao-Huang commented 3 months ago

Huggingface Model: https://huggingface.co/microsoft/Phi-3.5-vision-instruct

Fine-tuned Dataset: https://huggingface.co/datasets/linxy/LaTeX_OCR

Usually, fine-tuning a multimodal large model involves using a custom dataset for fine-tuning. Here, we will demonstrate a runnable demo.

Before starting the fine-tuning, please ensure that your environment is properly prepared.

git clone https://github.com/modelscope/ms-swift.git
cd swift
pip install -e .[llm]

Inference

# ModelScope
CUDA_VISIBLE_DEVICES=0 swift infer \
  --model_type phi3_5-vision-instruct \
  --use_flash_attn false

# HuggingFace
USE_HF=1 CUDA_VISIBLE_DEVICES=0 swift infer \
  --model_type phi3_5-vision-instruct \
  --model_id_or_path microsoft/Phi-3.5-vision-instruct \
  --use_flash_attn false

Results

<<< who are you
I am Phi, an AI developed by Microsoft to assist with providing information, answering questions, and helping users find solutions to their queries. How can I assist you today?
--------------------------------------------------
<<< <image>please describe the image.
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
The image features a close-up of a kitten with striking blue eyes and a white and grey striped coat. The kitten's fur is soft and fluffy, and it appears to be looking directly at the camera with a curious and innocent expression. The background is blurred, which puts the focus entirely on the kitten's face.
--------------------------------------------------
<<<  <image>What is the result of the calculation?
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
The result of the calculation 1452 + 45304 is 46756.

GPU Memory:

Jintao-Huang commented 3 months ago

Fine-tuning

The format of the custom dataset is as follows (single image, multiple images, and no image):

{"query": "<image>55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee<image>eeeee<image>eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response2"], ["query2", "response2"]], "images": []}

Fine-tuning script:

# To modify num_crops, you can use the environment variable: `NUM_CROPS=16` (default is 4).
# ModelScope
CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift sft \
  --model_type phi3_5-vision-instruct \
  --sft_type lora \
  --dataset latex-ocr-print#20000 \
  --deepspeed default-zero2 \
  --output_dir output \
  --num_train_epochs 5 \
  --use_flash_attn false

# HuggingFace
USE_HF=1 CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift sft \
  --model_type phi3_5-vision-instruct \
  --model_id_or_path microsoft/Phi-3.5-vision-instruct \
  --sft_type lora \
  --dataset latex-ocr-print#20000 \
  --deepspeed default-zero2 \
  --output_dir output \
  --num_train_epochs 5 \
  --use_flash_attn false

If you want to use a custom dataset, simply specify as follows:

  --dataset train.jsonl \
  --val_dataset val.jsonl \

One of the data samples:

Number of trainable parameters：

GPU Memory

Training process

Train Loss (Due to time constraints, we only fine-tuned for 1000 steps)：

train_loss (29)

Here is the inference script after fine-tuning, we perform inference on the automatically segmented validation set:

# To run a full test, please set: `--show_dataset_sample -1`
# If using HuggingFace, please add: `USE_HF=1`
# inference only
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/phi3_5-vision-instruct/vx-xxx/checkpoint-xxx \
    --load_dataset_config true --use_flash_attn false

# merge-lora & inference
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/phi3_5-vision-instruct/vx-xxx/checkpoint-xxx \
    --load_dataset_config true --merge_lora true \
    --safe_serialization false --use_flash_attn false

The results of the fine-tuned model inferring on the validation set: (Due to time constraints, we only fine-tuned for 1000 steps)：

praymich commented 3 months ago

你好，我想问下，以这种方式大概去微调三万多条中文的图文数据，能够让他有中文回复能力，并且学到里面的信息吗？

modelscope / ms-swift

Phi3.5-vision-instruct fine-tuning best practices. (Latex OCR Fine-tuning) #1809

Inference

Fine-tuning