mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.03k stars 162 forks source link

when use package 'evaluate‘ with 'sacrebleu' calculate metric happend error #263

Closed TristanShao closed 3 months ago

TristanShao commented 4 months ago

original code url: when i do run a simple case(add some model file) by 'python3 local_evaluation.py' and when some sample need get bleu metric, encountered such error File "xxx/amazon-kdd-cup-2024-starter-kit/local_evaluation.py", line 256, in <module> main() File "xxx/amazon-kdd-cup-2024-starter-kit/local_evaluation.py", line 241, in main per_task_metrics = evaluate_outputs(data_df, outputs) File "xxx/amazon-kdd-cup-2024-starter-kit/local_evaluation.py", line 99, in evaluate_outputs metric_score = eval_fn(model_output, ground_truth) File "xxx/amazon-kdd-cup-2024-starter-kit/local_evaluation.py", line 183, in <lambda> "jp-bleu": lambda generated_text, reference_text: metrics.calculate_bleu_score( File "xxx/amazon-kdd-cup-2024-starter-kit/metrics.py", line 254, in calculate_bleu_score sacrebleu = evaluate.load("sacrebleu") File "/home/xxx/.local/lib/python3.10/site-packages/evaluate/loading.py", line 751, in load evaluation_cls = import_main_class(evaluation_module.module_path) File "/home/xxx/.local/lib/python3.10/site-packages/evaluate/loading.py", line 76, in import_main_class module = importlib.import_module(module_path) File "/opt/miniconda3/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 879, in exec_module File "<frozen importlib._bootstrap_external>", line 1017, in get_code File "<frozen importlib._bootstrap_external>", line 947, in source_to_code File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/home/xxx/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--sacrebleu/009c8b5313309ea5b135d526433d5ee76508ba1554cbe88310a30f85bb57ec88/sacrebleu.py", line 16 } SyntaxError: closing parenthesis '}' does not match opening parenthesis '(' on line 14

It is looklike sacrebleu.py have some problem, maybe my evaluate(0.4.2) or sacrebleu(2.14) version confilict? I dont know

some code in url

def calculate_bleu_score(generated_text: str, reference_text: str, is_japanese: bool = False) -> float:
    """
    Calculates the BLEU score for a generated text compared to a reference truth text. This function supports
    both general text and Japanese-specific evaluation by using the sacrebleu library.

    Parameters:
    - generated_text (str): The generated text to be evaluated.
    - reference_text (str): The reference truth text.
    - is_japanese (bool, optional): Flag to indicate whether the text is in Japanese, requiring special tokenization.

    Returns:
    - float: The BLEU score as a percentage (0 to 1 scale) for the generated text against the reference truth.
    """
    global sacrebleu
    if sacrebleu is None:
        sacrebleu = evaluate.load("sacrebleu")

    # Preprocess input texts
    generated_text = generated_text.lstrip("\n").rstrip("\n").split("\n")[0]
    candidate = [generated_text]
    reference = [[reference_text]]

    # Compute BLEU score with or without Japanese-specific tokenization
    bleu_args = {"predictions": candidate, "references": reference, "lowercase": True}
    if is_japanese:
        bleu_args["tokenize"] = "ja-mecab"
    score = sacrebleu.compute(**bleu_args)["score"] / 100

    return score
martinpopel commented 3 months ago

It seems you are using amazon-kdd-cup-2024-starter-kit, which users HuggingFace evaluate, which uses sacrebleu. If you want to report a bug in this sacrebleu repository, you should show a replicable minimal test case, using sacrebleu API directly (i.e. not via amazon-kdd-cup-2024-starter-kit and evaluate). Otherwise, you should report the bug in one of the above-mentioned frameworks.

That said, I was not able to replicate this bug. Everything seems to work:

!pip install sacrebleu evaluate
import sacrebleu, evaluate
print(sacrebleu.__version__) # 2.4.2
print(evaluate.__version__) # 0.4.2
evaluate_sacrebleu = evaluate.load("sacrebleu")
result = evaluate_sacrebleu.compute(predictions=["John loves Mary."], references=[["John loves HugginFace."]])
print(result["score"]) # 35.35533905932737

I am closing this issue, but feel free to reopen if you identify any bugs in sacrebleu.