GPT-J: evaluation.py is not deterministic

szutenberg commented 1 year ago

We found that evaluation.py is not deterministic.

I narrowed down to small and fast reproducer using 100 examples which are already decoded.

Reproducer code:

import numpy as np
import json
import nltk
import evaluate

def postprocess_text(preds, targets):
    preds = [pred.strip() for pred in preds]
    targets = [target.strip() for target in targets]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    targets = ["\n".join(nltk.sent_tokenize(target)) for target in targets]

    return preds, targets

def main():
    metric = evaluate.load("rouge")
    nltk.download('punkt')

    with open('target_required.txt', 'r') as f:
        target_required = json.load(f)

    with open('preds_decoded_text.txt', 'r') as f:
        preds_decoded_text = json.load(f)

    preds, targets = postprocess_text(preds_decoded_text, target_required)

    result = metric.compute(predictions=preds, references=targets, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [len(pred) for pred in preds]
    result["gen_len"] = np.sum(prediction_lens)
    result["gen_num"] = len(preds)
    print("\nResults\n")
    print(result)

if __name__ == "__main__":
    main()

Results from 8 runs:

{'rouge1': 36.1576, 'rouge2': 15.144, 'rougeL': 27.6215, 'rougeLsum': 33.5262, 'gen_len': 21279, 'gen_num': 100}
{'rouge1': 36.1917, 'rouge2': 15.0866, 'rougeL': 27.5899, 'rougeLsum': 33.5717, 'gen_len': 21279, 'gen_num': 100}
{'rouge1': 36.1146, 'rouge2': 15.0713, 'rougeL': 27.533, 'rougeLsum': 33.5817, 'gen_len': 21279, 'gen_num': 100}
{'rouge1': 36.1648, 'rouge2': 15.2326, 'rougeL': 27.5165, 'rougeLsum': 33.5121, 'gen_len': 21279, 'gen_num': 100}
{'rouge1': 36.1399, 'rouge2': 15.1459, 'rougeL': 27.5729, 'rougeLsum': 33.6107, 'gen_len': 21279, 'gen_num': 100}
{'rouge1': 36.1275, 'rouge2': 15.1191, 'rougeL': 27.5854, 'rougeLsum': 33.5567, 'gen_len': 21279, 'gen_num': 100}
{'rouge1': 36.0872, 'rouge2': 15.0917, 'rougeL': 27.5943, 'rougeLsum': 33.6243, 'gen_len': 21279, 'gen_num': 100}
{'rouge1': 36.0724, 'rouge2': 15.1777, 'rougeL': 27.5256, 'rougeLsum': 33.6094, 'gen_len': 21279, 'gen_num': 100}

Differences are larger than 1% (15.2326 vs 15.0713 in rouge2) which makes this tool a bit problematic for robust accuracy evaluation.

Required files: preds_decoded_text.txt target_required.txt

I ran my experiments on docker ubuntu:latest to make sure that this is not machine/environment issue. Preparing environment:

apt-get update
apt-get install python3-pip
pip install -r requirements.txt

Pip freeze: pip_freeze.txt

badhri-intel commented 1 year ago

This issue is caused due to some randomness in rouge score code (in evaluate repo) and I fixed it by setting numpy random seed in the script. Please take a look here

szutenberg commented 1 year ago

I treat this fix as WA because now indeed it's deterministic but I feel that it's just way of hiding the problem.

Are you able to explain where do we have indeterminism in rouge score calculation? Aren't these scores just averages from all examples? Do you know how are they being calculated? Thanks!

badhri-intel commented 1 year ago

Ideally, they should be deterministic as they are F-1 scores of different n-grams. I'm looking at an existing issue in their repo and will update once I test the actual fix

badhri-intel commented 1 year ago

I found this issue here that talks about the same problem. They enable the BootstrapAggregator by default in the code which does random sampling to compute confidence intervals which causes run-to-run variation in ROUGE scores. From what they mention in the issue, it can be disabled safely. I've tested it and setting use_aggregator=False produces deterministic results. I've created a PR for the same

mlcommons / inference

GPT-J: evaluation.py is not deterministic #1385