I have fine-tuned the Qwen2-vl 7B model, and I am trying to perform inference but I can't figure out how to do it. The inference command used during fine-tuning is as follows:
However, this command is executed via CLI. I would like to convert this into a script that takes a video and a prompt as input, and returns the output for API usage. I have looked at an example from the non-fine-tuned version of the model, which uses single-sample inference as shown below:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch
model_type = ModelType.qwen2_vl_7b_instruct
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
query = """<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it from each city?"""
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
# Streaming
query = 'What is the farthest city?'
gen = inference_stream(model, template, query, history)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
I want to do something similar for my fine-tuned model, where I can input a video and a prompt into a script and return the result.
I would appreciate any guidance on how to adapt this for the fine-tuned version of the model.
I have fine-tuned the Qwen2-vl 7B model, and I am trying to perform inference but I can't figure out how to do it. The inference command used during fine-tuning is as follows:
However, this command is executed via CLI. I would like to convert this into a script that takes a video and a prompt as input, and returns the output for API usage. I have looked at an example from the non-fine-tuned version of the model, which uses single-sample inference as shown below:
I want to do something similar for my fine-tuned model, where I can input a video and a prompt into a script and return the result.
I would appreciate any guidance on how to adapt this for the fine-tuned version of the model.