open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.12k stars 437 forks source link

[Improvement] Mutiprocess Evaluation time Bug #1115

Open wdndev opened 6 months ago

wdndev commented 6 months ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

this is my cpu and gpu, I used the following machine for the test, max-works=32

CPU Info :     255  AMD EPYC 7713 64-Core Processor
GPU Info : NVIDIA H800-SXM4-80GB x 8    

Reproduces the problem - code/configuration sample

office code

Reproduces the problem - command or script

  1. I used a 2B model (for example qwen1.5 1.8B) to test 13 datasets, and the model was loaded using Huggingface.
  2. I recoeded the time token for each section, and found that the infer task took about 20minutes and the evaluation task took about 12 minutes.
  3. I found the file to calculate ppl and gen scores (predictions dir), about 500M of memory, so, why does it take 12 minutes in multipreocess? Shouldn't a calculation of 500M be done in about a minute?
  4. I found the code to run the evaluation task (opencompass/runners/local.py line 61 ~ 210 ), and found that the most time-consuming part of the evaluation was the serialization and deserialization of the configuration file (disk landing and loading). The code looks like this :
    # opencompass/runners/local.py  line 180 ~ 188
        # Dump task config to file
        mmengine.mkdir_or_exist('tmp/')
        param_file = f'tmp/{os.getpid()}_{index}_params.py'
        try:
            task.cfg.dump(param_file)     # **************  the most time-consuming
            tmpl = get_command_template(gpu_ids)
            get_cmd = partial(task.get_command,
                              cfg_path=param_file,
                              template=tmpl)
  5. When the task was started, I divided the evaluation task and the inference task. The inference task did not change, and the evaluation task eliminated the multi-process operation:

    • step 1 : Modify the submit function (opencompass/runners/local.py lines 133)
      
      def submit(task, index):
      # ...

    if num_gpus > 0: tqdm.write(f'launch {task.name} on GPU ' + ','.join(map(str, gpu_ids))) else: tqdm.write(f'launch {task.name} on CPU ')

    Modify Modify Modify

    if "OpenICLEvalTask" in self.task_cfg['type']: res = self._launch_eval(task, gpu_ids, index) else: res = self._launch_infer(task, gpu_ids, index) # old self._launch

    pbar.update()

    with lock: gpus[gpu_ids] += 1 return res

    
    - step 2: new add `self._launch_eval` function
def _launch_eval(self, task, gpu_ids, index):
    logger = get_logger()
    task_name = task.name
    out_path = task.get_log_path(file_extension='out')
    mmengine.mkdir_or_exist(osp.split(out_path)[0])

    from opencompass.tasks.openicl_eval import OpenICLEvalTask

    start_time = time.time()
    exitcode = 0
    try:
        inferencer = OpenICLEvalTask(task.cfg)
        inferencer.run()
    except Exception as e:
        print("except: ", e)
        exitcode = 1

    end_time = time.time()
    get_logger().info(f'time elapsed: {end_time - start_time:.2f}s')

    if exitcode != 0:
        logger.error(f'exitcode {exitcode}, task {task_name} failed, see\n{out_path}')

    return task_name, exitcode
  1. It took me 40 seconds to run the 13 datasets before the revised code evaluation.
  2. so, i hope the authorities can improve this bug. And I am now modifying the code so that logs cannot be written to the evaluated datasets logs, so PR is not created!

Reproduces the problem - error message

Evaluation time improvement

Other information

No response

tonysy commented 6 months ago

Thanks for the report. We will look into this issue and update soon.