Closed Jintao-Huang closed 1 month ago
The format of the custom dataset is as follows (single image, multiple images, and no image):
{"query": "<image>55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee<image>eeeee<image>eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response2"], ["query2", "response2"]], "images": []}
Fine-tuning script:
# To modify num_crops, you can use the environment variable: `NUM_CROPS=16` (default is 4).
# ModelScope
CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift sft \
--model_type phi3_5-vision-instruct \
--sft_type lora \
--dataset latex-ocr-print#20000 \
--deepspeed default-zero2 \
--output_dir output \
--num_train_epochs 5 \
--use_flash_attn false
# HuggingFace
USE_HF=1 CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift sft \
--model_type phi3_5-vision-instruct \
--model_id_or_path microsoft/Phi-3.5-vision-instruct \
--sft_type lora \
--dataset latex-ocr-print#20000 \
--deepspeed default-zero2 \
--output_dir output \
--num_train_epochs 5 \
--use_flash_attn false
If you want to use a custom dataset, simply specify as follows:
--dataset train.jsonl \
--val_dataset val.jsonl \
One of the data samples:
Number of trainable parameters:
GPU Memory
Training process
Train Loss (Due to time constraints, we only fine-tuned for 1000 steps):
Here is the inference script after fine-tuning, we perform inference on the automatically segmented validation set:
# To run a full test, please set: `--show_dataset_sample -1`
# If using HuggingFace, please add: `USE_HF=1`
# inference only
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/phi3_5-vision-instruct/vx-xxx/checkpoint-xxx \
--load_dataset_config true --use_flash_attn false
# merge-lora & inference
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/phi3_5-vision-instruct/vx-xxx/checkpoint-xxx \
--load_dataset_config true --merge_lora true \
--safe_serialization false --use_flash_attn false
The results of the fine-tuned model inferring on the validation set: (Due to time constraints, we only fine-tuned for 1000 steps):
你好,我想问下,以这种方式大概去微调三万多条中文的图文数据,能够让他有中文回复能力,并且学到里面的信息吗?
Huggingface Model: https://huggingface.co/microsoft/Phi-3.5-vision-instruct
Fine-tuned Dataset: https://huggingface.co/datasets/linxy/LaTeX_OCR
Usually, fine-tuning a multimodal large model involves using a custom dataset for fine-tuning. Here, we will demonstrate a runnable demo.
Before starting the fine-tuning, please ensure that your environment is properly prepared.
Inference
Results
GPU Memory: