[Feature] 请问使用vllm评测时怎么实现类似HF多卡数据并行？

open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

https://opencompass.org.cn/

Apache License 2.0

3.63k stars 384 forks source link

[Feature] 请问使用vllm评测时怎么实现类似HF多卡数据并行？ #1002

Open noforit opened 5 months ago

noforit commented 5 months ago

描述该功能

我在评测时的模型type 为vllm，参数如下：但是显卡占用只使用了一张卡来评测任务我想让任务划分为几份分别在8张卡上评测，这种功能可以添加吗？还是说可以实现，麻烦解答一下。非常感激！类似我如果设定为模型type为HF的话，会自动达到这种效果。

是否希望自己实现该功能？

[ ] 我希望自己来实现这一功能，并向 OpenCompass 贡献代码！

liushz commented 5 months ago

like above cfg, you can set model_kwargs=dict(tensor_parallel_size=8), for your case.

noforit commented 5 months ago

@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.

andakai commented 5 months ago

@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.

hi, @liushz , I also want to know how to achieve data parallelism in vLLM when evaluating

tonysy commented 5 months ago

Please try NumWorkerPartitioner https://github.com/open-compass/opencompass/blob/main/opencompass/partitioners/num_worker.py#L17

noforit commented 5 months ago

@tonysy Could you possibly offer a quick example? I'm quite unsure how to ues it. Many thanks for your assistance.

IcyFeather233 commented 5 months ago

我感觉这个应该是要看VLLM的文档，：https://docs.vllm.ai/en/latest/serving/distributed_serving.html，我tensor_parallel_size设置的和GPU数量一样是可以的。

noforit commented 5 months ago

@IcyFeather233 谢谢你😂，我明白这个tensor_parallel_size可以设定为GPU数2，4，8实现模型分片并行。我这里意思是tensor_parallel_size为1，但是GPU 每张卡都加载一整个模型，然后数据并行，同时评测一个任务的不同数据。最近我实现了该种功能，使用NumWorkerPartitioner。以下为关键参数配置：有需要的可以借鉴。 @darrenglow 。同时感谢 @tonysy 。要是能尽快更新到文档就更好了。

Zbaoli commented 4 months ago

@noforit 我是这样配的，但还是只有一张卡在跑，能帮我看看原因吗；

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=2),
    runner=dict(
        type=LocalRunner,
        max_num_workers=16,
        task=dict(type=OpenICLInferTask))
)

models = [
    dict(
        type=VLLM,
        abbr='qwen-7b-chat-vllm',
        path="/home/zbl/data/llm/qwen/Qwen-7B-Chat",
        model_kwargs=dict(tensor_parallel_size=1),
        meta_template=_meta_template,
        max_out_len=100,
        max_seq_len=2048,
        batch_size=100,
        generation_kwargs=dict(temperature=0),
        end_str='<|im_end|>',
    )
]

Zbaoli commented 4 months ago

@IcyFeather233 我知道你的意思，tensor_parallel_size参数可以设置多卡推理，但我试了下使用多卡推理速度并没有比单卡变快；所以我想实现的是多个任务并行推理：比如我有n个任务，同时用m个模型，每个模型执行一个任务的推理；

noforit commented 4 months ago

@Zbaoli 我看你的参数和我差了一个加个这个试试？

Zbaoli commented 4 months ago

@noforit 谢谢你的回复，但我在models的配置中加了run_cfg=dict(num_gpus=1, num_proces=1)参数之后还是只有一个 gpu 在运行；

noforit commented 4 months ago

@Zbaoli 奇怪😂。在程序运行前加上 CUDA_VISIBLE_DEVICES 呢或者你在/opencompass/opencompass/runners/local.py 里面调试一下？里面会自动检测显卡数量啥的加个微信？我发你邮件

guoaoo commented 4 months ago

@IcyFeather233 谢谢你😂，我明白这个tensor_parallel_size可以设定为GPU数2，4，8实现模型分片并行。我这里意思是tensor_parallel_size为1，但是GPU 每张卡都加载一整个模型，然后数据并行，同时评测一个任务的不同数据。最近我实现了该种功能，使用NumWorkerPartitioner。以下为关键参数配置：有需要的可以借鉴。 @darrenglow 。同时感谢 @tonysy 。要是能尽快更新到文档就更好了。

这里使用了NumWorkerPartitioner后，数据集被拆分成了8份，但最终的summary没法将拆分后的数据集的指标结果汇总在一起，请问您会这样吗？

caotianjia commented 4 months ago

@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.

我请教下，opencompass提供的Sizepartitioner不就可以对数据集进行切割么？还是说NumWorkerPartitioner的partition方式要更高效一些？

bittersweet1999 commented 4 months ago

@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.

我请教下，opencompass提供的Sizepartitioner不就可以对数据集进行切割么？还是说NumWorkerPartitioner的partition方式要更高效一些？

size partitioner和numworker partitioner是两种不同的切分方式，一个是按给定的size切分，一个是按照卡的数目切分

disperaller commented 2 months ago

当使用vllm的时候不知道为什么一直报timeout 上面部分是模型的设置下面的是错误请问是怎么回事啊？