open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
3.6k stars 377 forks source link

[Bug] 在使用api模型进行评测时,batch size没有生效,始终是同时只有一条请求在进行推理 #971

Closed xyfZzz closed 3 months ago

xyfZzz commented 5 months ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

在使用api模型进行评测时,batch size没有生效,始终是同时只有一条请求在进行推理

Reproduces the problem - code/configuration sample

在使用api模型进行评测时,batch size没有生效,始终是同时只有一条请求在进行推理

Reproduces the problem - command or script

在使用api模型进行评测时,batch size没有生效,始终是同时只有一条请求在进行推理

Reproduces the problem - error message

在使用api模型进行评测时,batch size没有生效,始终是同时只有一条请求在进行推理

Other information

在使用api模型进行评测时,batch size没有生效,始终是同时只有一条请求在进行推理

bittersweet1999 commented 5 months ago

Can you provide a detailed config so I can help you

xyfZzz commented 5 months ago

Can you provide a detailed config so I can help you

eval_openai.py:

from copy import deepcopy
from mmengine.config import read_base

with read_base():
    from .datasets.teval.teval_en_gen_1ac254 import teval_datasets as teval_en_datasets
    from .datasets.teval.teval_zh_gen_1ac254 import teval_datasets as teval_zh_datasets

    # from .models.qwen.hf_qwen_7b_chat import models as hf_qwen_7b_chat_model
    # from .models.hf_internlm.hf_internlm2_chat_7b import models as hf_internlm2_chat_7b_model
    # from .models.hf_llama.hf_llama2_7b_chat import models as hf_llama2_7b_chat_model
    from .models.openai.gpt_3_5_turbo import models as gpt_3_5_model

    from .summarizers.teval import summarizer

meta_template_system_patches = {
    'internlm2-chat-7b-hf': dict(role='SYSTEM', begin='<|im_start|>system\n', end='<|im_end|>\n'),
    'internlm2-chat-20b-hf': dict(role='SYSTEM', begin='<|im_start|>system\n', end='<|im_end|>\n'),
}

_origin_models = sum([v for k, v in locals().items() if k.endswith("_model")], [])
models = []
for m in _origin_models:
    m = deepcopy(m)
    if 'meta_template' in m and 'round' in m['meta_template']:
        round = m['meta_template']['round']
        if all(r['role'].upper() != 'SYSTEM' for r in round):  # no system round
            if m['abbr'] in meta_template_system_patches:
                system_round = meta_template_system_patches[m['abbr']]
            else:
                system_round = [r for r in round if r['role'].upper() == 'HUMAN'][0]
                system_round = deepcopy(system_round)
                system_round['role'] = 'SYSTEM'
            m['meta_template']['round'].append(system_round)
    else:
        raise ValueError(f'no meta_template.round in {m.get("abbr", None)}')

    print(f'model {m["abbr"]} is using the following meta_template: {m["meta_template"]}')
    models.append(m)

datasets = teval_en_datasets + teval_zh_datasets
work_dir = './outputs/teval'

然后在openai脚本的generate函数打印了inputs的长度:

    def generate(
        self,
        inputs: List[str or PromptList],
        max_out_len: int = 512,
        temperature: float = 0.7,
    ) -> List[str]:
        """Generate results given a list of inputs.

        Args:
            inputs (List[str or PromptList]): A list of strings or PromptDicts.
                The PromptDict should be organized in OpenCompass'
                API format.
            max_out_len (int): The maximum length of the output.
            temperature (float): What sampling temperature to use,
                between 0 and 2. Higher values like 0.8 will make the output
                more random, while lower values like 0.2 will make it more
                focused and deterministic. Defaults to 0.7.

        Returns:
            List[str]: A list of generated strings.
        """
        if self.temperature is not None:
            temperature = self.temperature

        print("openai len(inputs): ", len(inputs))

        with ThreadPoolExecutor() as executor:
            results = list(
                executor.map(self._generate, inputs,
                             [max_out_len] * len(inputs),
                             [temperature] * len(inputs)))
        return results

发现长度始终都是1,那这个多线程始终没起作用。

bittersweet1999 commented 5 months ago

you can try to set a bigger max_num_workers in your runner

xyfZzz commented 5 months ago

you can try to set a bigger max_num_workers in your runner

ok

xyfZzz commented 5 months ago

you can try to set a bigger max_num_workers in your runner

@bittersweet1999 在调用完api模型后,报了个错,显示gpu数目不对,但是我用的api模型为什么还要gpu?: ··· 100%|██████████| 22/22 [6:20:04<00:00, 1036.57s/it] 0%| | 0/16 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/xie/code/zone/github/new/opencompass_backup/run.py", line 357, in main() File "/home/xie/code/zone/github/new/opencompass_backup/run.py", line 344, in main runner(tasks) File "/home/xie/code/zone/github/new/opencompass_backup/opencompass/runners/base.py", line 39, in call self.summarize(status) File "/home/xie/code/zone/github/new/opencompass_backup/opencompass/runners/base.py", line 61, in summarize for _task, code in status: File "/home/xie/apps/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/home/xie/apps/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/home/xie/apps/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/home/xie/apps/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/home/xie/apps/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/xie/code/zone/github/new/opencompass_backup/opencompass/runners/local.py", line 122, in submit assert len(gpus) >= num_gpus AssertionError ···

bittersweet1999 commented 5 months ago

In the evaluation stage of t-eval, it is recommended to utilize a GPU for efficient loading of transformers to compare the model's predictions against the gold standard answers, if you don't want to use GPU, just change here https://github.com/open-compass/opencompass/blob/3098d788455dc785e6830f8c69eb9d1010c0cce1/configs/datasets/teval/teval_en_gen_1ac254.py#L39 However, be aware that performance will be significantly slower if you opt to process the evaluation using a CPU instead.

xyfZzz commented 5 months ago

In the evaluation stage of t-eval, it is recommended to utilize a GPU for efficient loading of transformers to compare the model's predictions against the gold standard answers, if you don't want to use GPU, just change here

https://github.com/open-compass/opencompass/blob/3098d788455dc785e6830f8c69eb9d1010c0cce1/configs/datasets/teval/teval_en_gen_1ac254.py#L39

However, be aware that performance will be significantly slower if you opt to process the evaluation using a CPU instead.

是说在做结果比对的时候还需要用gpu加载模型来进行结果对比吗?进行结果对比用的是什么模型?能改成用api模型来做结果对比吗?

bittersweet1999 commented 5 months ago

In the evaluation stage of t-eval, it is recommended to utilize a GPU for efficient loading of transformers to compare the model's predictions against the gold standard answers, if you don't want to use GPU, just change here https://github.com/open-compass/opencompass/blob/3098d788455dc785e6830f8c69eb9d1010c0cce1/configs/datasets/teval/teval_en_gen_1ac254.py#L39

However, be aware that performance will be significantly slower if you opt to process the evaluation using a CPU instead.

是说在做结果比对的时候还需要用gpu加载模型来进行结果对比吗?进行结果对比用的是什么模型?能改成用api模型来做结果对比吗?

Yes, a GPU is required for evaluation during the assessment phase; this step cannot be performed with an API model.

bittersweet1999 commented 3 months ago

feel free to reopen it if needed