OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
I'm evaluating with the officially supported tasks/models/datasets.
Environment
this is my cpu and gpu, I used the following machine for the test, max-works=32
CPU Info : 255 AMD EPYC 7713 64-Core Processor
GPU Info : NVIDIA H800-SXM4-80GB x 8
Reproduces the problem - code/configuration sample
office code
Reproduces the problem - command or script
I used a 2B model (for example qwen1.5 1.8B) to test 13 datasets, and the model was loaded using Huggingface.
I recoeded the time token for each section, and found that the infer task took about 20minutes and the evaluation task took about 12 minutes.
I found the file to calculate ppl and gen scores (predictions dir), about 500M of memory, so, why does it take 12 minutes in multipreocess? Shouldn't a calculation of 500M be done in about a minute?
I found the code to run the evaluation task (opencompass/runners/local.py line 61 ~ 210 ), and found that the most time-consuming part of the evaluation was the serialization and deserialization of the configuration file (disk landing and loading). The code looks like this :
# opencompass/runners/local.py line 180 ~ 188
# Dump task config to file
mmengine.mkdir_or_exist('tmp/')
param_file = f'tmp/{os.getpid()}_{index}_params.py'
try:
task.cfg.dump(param_file) # ************** the most time-consuming
tmpl = get_command_template(gpu_ids)
get_cmd = partial(task.get_command,
cfg_path=param_file,
template=tmpl)
When the task was started, I divided the evaluation task and the inference task. The inference task did not change, and the evaluation task eliminated the multi-process operation:
step 1 : Modify the submit function (opencompass/runners/local.py lines 133)
def submit(task, index):
# ...
if num_gpus > 0:
tqdm.write(f'launch {task.name} on GPU ' +
','.join(map(str, gpu_ids)))
else:
tqdm.write(f'launch {task.name} on CPU ')
Modify Modify Modify
if "OpenICLEvalTask" in self.task_cfg['type']:
res = self._launch_eval(task, gpu_ids, index)
else:
res = self._launch_infer(task, gpu_ids, index) # old self._launch
It took me 40 seconds to run the 13 datasets before the revised code evaluation.
so, i hope the authorities can improve this bug. And I am now modifying the code so that logs cannot be written to the evaluated datasets logs, so PR is not created!
Prerequisite
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
this is my cpu and gpu, I used the following machine for the test, max-works=32
Reproduces the problem - code/configuration sample
office code
Reproduces the problem - command or script
predictions
dir), about 500M of memory, so, why does it take 12 minutes in multipreocess? Shouldn't a calculation of 500M be done in about a minute?When the task was started, I divided the evaluation task and the inference task. The inference task did not change, and the evaluation task eliminated the multi-process operation:
if num_gpus > 0: tqdm.write(f'launch {task.name} on GPU ' + ','.join(map(str, gpu_ids))) else: tqdm.write(f'launch {task.name} on CPU ')
Modify Modify Modify
if "OpenICLEvalTask" in self.task_cfg['type']: res = self._launch_eval(task, gpu_ids, index) else: res = self._launch_infer(task, gpu_ids, index) # old self._launch
pbar.update()
with lock: gpus[gpu_ids] += 1 return res
Reproduces the problem - error message
Evaluation time improvement
Other information
No response