Closed 1028686314 closed 3 weeks ago
参考https://github.com/OpenGVLab/InternVL/issues/223
根据hf [(https://huggingface.co/OpenGVLab/InternVL2-26B)] 说明,InternVL2已经增加了多图的训练数据。
InternVL 2.0 is trained with an 8k context window and utilizes training data consisting of long texts, multiple images, and videos, significantly improving its ability to handle these types of inputs compared to InternVL 1.5.
也支持多图的推理,
pipe = pipeline(model, chat_template_config=chat_template_config,
backend_config=TurbomindEngineConfig(session_len=8192))
image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]
images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)
swift中是cat多张图作为统一的token [(https://github.com/modelscope/swift/blob/main/swift/llm/utils/template.py#L1231-L1236)] 其中对于多图的处理方式是不是和swift中对多张图的处理不同?会有什么影响吗
@cyj95 internvl2模型的template见 https://github.com/modelscope/swift/blob/main/swift/llm/utils/template.py#L1386 多图处理是对齐官方的
你好 我想请问下 如果我想输入尽量多的图片进行微调和推理internvl-chat-v1.5 需要修改哪些设置呀?是修改max_length参数就够了嘛。如果修改这个参数做sft,会不会影响模型效果呀? 我现在测试下来最多输入8张图左右,现在想利用视频抽帧数据做sft,8张图有点不太能满足需求。