princeton-nlp / HELMET

The HELMET Benchmark
https://arxiv.org/abs/2410.02694
MIT License
51 stars 7 forks source link

ALCE citation evaluation #7

Open carriex opened 3 days ago

carriex commented 3 days ago

Thanks for the great work! I am looking at running the ALCE evaluation and notice that the script is loading an NLI model from a local path.

AUTOAIS_MODEL="/scratch/gpfs/hyen/models/t5_xxl_true_nli_mixture"

Is this the same model as this one on huggingface?

I tried running the script with the huggingface model but got different citation precision / recall numbers for Meta-Llama-3.1-8B-instruct than that reported in the spreadsheet.

Thanks!

howard-yen commented 3 days ago

Hi, thank you for your interest in our work!

You are correct, the NLI model should be google/t5_xxl_true_nli_mixture, and I will update the evaluation script accordingly, thanks for catching this mistake.

Could you share the results that you got and the arguments that you used to run the evaluation?

carriex commented 2 days ago

Thanks for getting back to me! I realized that I was not looking at the right context length when comparing my results against that in the spreadsheet. Thought it does seem like there is some differences in citation_rec / citation_prec for Llama-3.1-8b-Instruct for ALCE at 128k context length.

For reference, I ran he test using this config but for Llama-3.1-8B-Instruct and got below results for ALCE:

{ "length": 175.03, "str_em": 15.283333333333331, "str_hit": 4.0, "rougeLsum": 19.226019459757403, "citation_rec": 0.08695652173913043, "citation_prec": 0.10526315789473684, "citation_positions": { "0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1, "10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1 } }

howard-yen commented 1 day ago

Are these the results for ASQA? These appear to be rather close to the results that we report in the paper. In general, 1-2 points difference in absolute scores is reasonable given the nondeterministic nature of Flash Attention. However, it would be good to double check that the arguments used are the exact same, this is what the args in my output file look like:

{
  "config": "configs/alce.yaml",
  "tag": "v12",
  "model_name_or_path": "/scratch/gpfs/hyen/models/Meta-Llama-3.1-8B-Instruct",
  "use_vllm": false,
  "datasets": "alce_asqa_700",
  "demo_files": "prompts/asqa_revised.json",
  "test_files": "data/alce/asqa_eval_gtr_top2000.json",
  "output_dir": "output/Meta-Llama-3.1-8B-Instruct",
  "overwrite": false,
  "max_test_samples": 100,
  "num_workers": 4,
  "num_depths": 10,
  "shots": 2,
  "input_max_length": 131072,
  "do_sample": false,
  "generation_max_length": 300,
  "generation_min_length": 0,
  "temperature": 1.0,
  "top_p": 1.0,
  "stop_newline": false,
  "seed": 42,
  "no_cuda": false,
  "no_bf16": false,
  "no_torch_compile": false,
  "use_chat_template": true,
  "rope_theta": null,
  "debug": false,
  "count_tokens": false,
  "stop_new_line": false
}

For reference, our score file looks like this:

{
  "length": 167.54,
  "str_em": 16.95,
  "str_hit": 5.0,
  "rougeLsum": 19.499888451263413,
  "mauve": 28.60813027025195,
  "citation_rec": 0.0,
  "citation_prec": 0.0,
  "citation_positions": {
    "1": 1,
    "0": 1
  }
}