[Bug] Error in Evaluation for HumanEval with pass@10 #1474

Open Rcrossmeister opened 3 months ago

Rcrossmeister commented 3 months ago



I'm evaluating with the officially supported tasks/models/datasets.


{'CUDA available': True,
 'CUDA_HOME': None,
 'GCC': 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0',
 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A100-SXM4-80GB',
 'MMEngine': '0.10.4',
 'MUSA available': False,
 'OpenCV': '4.10.0',
 'PyTorch': '2.4.0+cu121',
 'Python': '3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]',
 'TorchVision': '0.19.0+cu121',
 'lmdeploy': '0.5.3',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.3.1+88eb912',
 'sys.platform': 'linux',
 'transformers': '4.44.0'}

Reproduces the problem - code/configuration sample

# This config is used for pass@k evaluation with `num_return_sequences`
# That model can generate multiple responses for single input
from mmengine.config import read_base
from opencompass.partitioners import SizePartitioner
from opencompass.models import HuggingFaceCausalLM
from opencompass.runners import LocalRunner
from opencompass.partitioners import SizePartitioner
from opencompass.tasks import OpenICLInferTask

with read_base():
    from opencompass.configs.datasets.humaneval.humaneval_passk_gen_8e312c import humaneval_datasets

datasets = []
datasets += humaneval_datasets

models = [
        model_kwargs=dict(trust_remote_code=True, device_map='auto'),
        run_cfg=dict(num_gpus=1, num_procs=1),

infer = dict(
    partitioner=dict(type=SizePartitioner, max_task_size=300),
        type=LocalRunner, max_num_workers=16,

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=4,5,6,7 python configs/

Reproduces the problem - error message

Error in the terminal:

08/31 18:01:52 - OpenCompass - INFO - Current exp folder: outputs/default/20240831_180152                                                                                                                                                                                                                          
08/31 18:01:52 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.                                                                                                                                                                                                            
08/31 18:01:52 - OpenCompass - INFO - Partitioned into 11 tasks.                                                                                                                                                                                                                                                   
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_0] on GPU 4,5,6,7                                                                                                                                                                                                                                
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_2] on GPU 4,5,6,7                                                                                                                                                                                                                                
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_3] on GPU 4,5,6,7                                                                                                                                                                                                                                
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_8] on GPU 4,5,6,7                                                                                                                                                                                                                                
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_9] on GPU 4,5,6,7                                                                                                                                                                                                                                
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_1] on GPU 4,5,6,7                                                                                                                                                                                                                                
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_6] on GPU 4,5,6,7                                                                                                                                                                                                                                
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_7] on GPU 4,5,6,7                                                                                                                                                                                                                                
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_5] on GPU 4,5,6,7                                                                                                                                                                                                                                
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_4] on GPU 4,5,6,7                                                                                                                                                                                                                                
launch OpenICLInfer[llama-3-8b-instruct-hf/openai_humaneval_passk_10] on GPU 4,5,6,7                                                                                                                                                                                                                               
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [2:40:17<00:00, 874.34s/it]
08/31 20:42:10 - OpenCompass - INFO - Partitioned into 1 tasks.                                                                                                                                                                                                                                                    
launch OpenICLEval[llama-3-8b-instruct-hf/openai_humaneval_passk] on CPU                                                                                                                                                                                                                                           
  0%|                                                                                                                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s]
08/31 20:54:08 - OpenCompass - ERROR - /mypath/opencompass/opencompass/runners/ - _launch - 228 - task OpenICLEval[llama-3-8b-instruct-hf/openai_humaneval_passk] fail, see                                                                                                              
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [11:58<00:00, 718.40s/it]
08/31 20:54:08 - OpenCompass - ERROR - /mypath/opencompass/opencompass/runners/ - summarize - 64 - OpenICLEval[llama-3-8b-instruct-hf/openai_humaneval_passk] failed with code 1                                                                                                          
dataset                 version    metric    mode    llama-3-8b-instruct-hf                                                                                                                                                                                                                                        
----------------------  ---------  --------  ------  ------------------------                                                                                                                                                                                                                                      
openai_humaneval_passk  -          -         -       -                                                                                                                                                                                                                                                             
08/31 20:54:08 - OpenCompass - INFO - write summary to /mypath/opencompass/outputs/default/20240831_180152/summary/summary_20240831_180152.txt                                                                                                                                                   
08/31 20:54:08 - OpenCompass - INFO - write csv to /mypath/opencompass/outputs/default/20240831_180152/summary/summary_20240831_180152.csv 

Error in the output log:

100%|██████████| 1640/1640 [00:56<00:00, 28.79it/s]
Writing results to /tmp/tmpwnhgvapc/human_eval.json_results.jsonl...
  0%|          | 0/1640 [00:00<?, ?it/s]
100%|██████████| 1640/1640 [00:00<00:00, 55831.91it/s]
Traceback (most recent call last):
  File "/mypath/opencompass/opencompass/tasks/", line 397, in <module>
  File "/mypath/opencompass/opencompass/tasks/", line 114, in run
  File "/mypath/opencompass/opencompass/tasks/", line 230, in _score
    result = icl_evaluator.score(**preds)
  File "/mypath/opencompass/opencompass/datasets/", line 111, in score
    line['prompt'] = prompts[index]
IndexError: list index out of range

Other information

Aside from the evaluation error mentioned above, I also tried using vLLM acceleration during my inference:

CUDA_VISIBLE_DEVICES=4,5,6,7 python configs/ -a vllm

which resulted in an error:

INFO 08-31 15:50:37] Loading model weights took 12.5552 GB
INFO 08-31 15:50:41] # GPU blocks: 7299, # CPU blocks: 512
INFO 08-31 15:50:45] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-31 15:50:45] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-31 15:51:06] Graph capturing finished in 21 secs.
08/31 15:51:07 - OpenCompass - INFO - Start inferencing [CodeLlama-7b-Python-vllm/openai_humaneval_passk_0]
[2024-08-31 15:51:07,069] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...

  0%|          | 0/2 [00:00<?, ?it/s]
  0%|          | 0/2 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mypath/opencompass/opencompass/tasks/", line 161, in <module>
[rank0]:   File "/mypath/opencompass/opencompass/tasks/", line 89, in run
[rank0]:     self._inference()
[rank0]:   File "/mypath/opencompass/opencompass/tasks/", line 134, in _inference
[rank0]:     inferencer.inference(retriever,
[rank0]:   File "/mypath/opencompass/opencompass/openicl/icl_inferencer/", line 152, in inference
[rank0]:     results = self.model.generate_from_template(
[rank0]:   File "/mypath/opencompass/opencompass/models/", line 201, in generate_from_template
[rank0]:     return self.generate(inputs, max_out_len=max_out_len, **kwargs)
[rank0]:   File "/mypath/opencompass/opencompass/models/", line 98, in generate
[rank0]:     sampling_kwargs = SamplingParams(**generation_kwargs)
[rank0]: TypeError: SamplingParams.__init__() got an unexpected keyword argument 'num_return_sequences'

Does this mean that the pass@k evaluation currently does not support vLLM acceleration?


Rcrossmeister commented 3 months ago

I also checked the prediction files, which seem fine:

"0": {
        "origin_prompt": "Complete the following python code:\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"\n",
        "prediction": [
            "    numbers = sorted(numbers)\n    for i in range(len(numbers) - 1):\n        if abs(numbers[i + 1] - numbers[i]) < threshold:\n            return True\n    return False",
            "    for i in range(len(numbers)):\n        for j in range(i + 1, len(numbers)):\n            if abs(numbers[i] - numbers[j]) <= threshold:\n                return True\n    return False",
            "    for i in range(len(numbers)):\n        for j in range(i + 1, len(numbers)):\n            if abs(numbers[i] - numbers[j]) < threshold:\n                return True\n    return False",
            "    for i in range(len(numbers)):\n        for j in range(i + 1, len(numbers)):\n            if abs(numbers[i] - numbers[j]) < threshold:\n                return True\n    return False",
            "    for i in range(len(numbers)):\n        for j in range(i+1, len(numbers)):\n            if abs(numbers[i] - numbers[j]) < threshold:\n                return True\n    return False",
            "    for i in range(len(numbers)):\n        for j in range(i + 1, len(numbers)):\n            if abs(numbers[i] - numbers[j]) < threshold:\n                return True\n    return False",
            "    for i in range(len(numbers)):\n        for j in range(i+1, len(numbers)):\n            if abs(numbers[i] - numbers[j]) < threshold:\n                return True\n    return False",
            "    for i in range(len(numbers) - 1):\n        if abs(numbers[i] - numbers[i + 1]) <= threshold:\n            return True\n    return False",
            "    for i in range(len(numbers)):\n        for j in range(i + 1, len(numbers)):\n            if abs(numbers[i] - numbers[j]) < threshold:\n                return True\n    return False"
        ]
            "    # implement the logic here\n    for i in range(len(numbers)):\n        for j in range(i + 1, len(numbers)):\n            if abs(numbers[i] - numbers[j]) < threshold:\n                return True\n    return False\n\n# test the function\nprint(has_close_elements([1.0, 2.0, 3.0], 0.5))  # False\nprint(has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3))  # True\nprint(has_close_elements([1.0, 2.0, 3.0, 4.0, 5.0], 0.6))  # False\n\nThis function checks if there are two numbers in a given list that are closer to each other than a given threshold. The function should return True if there are any such numbers and False otherwise. It should work for lists of any length and should be able to handle floating point numbers. \n\nThe function is implemented using a nested loop, which checks each pair of numbers in the list to see if they are closer to each other than the threshold. If it finds a pair of numbers that are closer to each other than the threshold, it returns True. If it does not find any such pairs, it returns False.\n\nThe unit tests provided check the function with different inputs to make sure it is working correctly. The first test checks a list with no close numbers, the second test checks a list with close numbers, and the third test checks a list with no close numbers again. \n\nThe function is designed to be efficient and easy to understand, with clear variable names and a clear description of what the function does. The use of a threshold value allows the function to be flexible and reusable in different contexts. \n\nThe function could be improved by using a more efficient algorithm, such as sorting the list and then iterating through it to find close numbers. This would reduce the time complexity of the function from O(n^2) to O(n log n). However, the current implementation is simple and easy to understand, and it may be sufficient for many use cases. \n\nThe function could also be improved by adding additional features, such as the ability to handle lists of complex numbers or the ability to specify a different type of comparison (such as whether to compare the absolute difference or the relative difference). However, these features would depend on the specific requirements of the problem"
        "gold": "HumanEval/0"
watermelon-hjg commented 2 weeks ago

I'm having the same problem. Have you solved it?