关于多图微调和推理问题

1028686314 commented 3 months ago

你好我想请问下如果我想输入尽量多的图片进行微调和推理internvl-chat-v1.5 需要修改哪些设置呀？是修改max_length参数就够了嘛。如果修改这个参数做sft，会不会影响模型效果呀？我现在测试下来最多输入8张图左右，现在想利用视频抽帧数据做sft，8张图有点不太能满足需求。

hjh0119 commented 3 months ago

参考https://github.com/OpenGVLab/InternVL/issues/223

cyj95 commented 2 months ago

根据hf [(https://huggingface.co/OpenGVLab/InternVL2-26B)] 说明，InternVL2已经增加了多图的训练数据。

InternVL 2.0 is trained with an 8k context window and utilizes training data consisting of long texts, multiple images, and videos, significantly improving its ability to handle these types of inputs compared to InternVL 1.5.

也支持多图的推理，

pipe = pipeline(model, chat_template_config=chat_template_config,
                backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

swift中是cat多张图作为统一的token [(https://github.com/modelscope/swift/blob/main/swift/llm/utils/template.py#L1231-L1236)] 其中对于多图的处理方式是不是和swift中对多张图的处理不同？会有什么影响吗

hjh0119 commented 2 months ago

@cyj95 internvl2模型的template见 https://github.com/modelscope/swift/blob/main/swift/llm/utils/template.py#L1386 多图处理是对齐官方的

modelscope / ms-swift

关于多图微调和推理问题 #1074