vectara / hallucination-leaderboard

Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents
https://vectara.com
Apache License 2.0
1.25k stars 50 forks source link

Reproducing HF model Summaries #80

Open Noor-Nizar opened 6 days ago

Noor-Nizar commented 6 days ago

I'm trying to reproduce the summaries generated by HF models, namely Phi-2, and Llama 3.2-1B instruct, since the result I'm getting following the described prompt / pipeline is not close to the one in the leaderboard. Comparing the summary generation im getting with the one in the hf dataset, I found that theres a large difference. One thing those models are sturggling with for example that I don't see in HF dataset was sentence reptition.

So my question is is generation config used for hugginface models ? I'm currently using the text-gerenation pipelien setting do_sample=False (as i found mentioned in another issue that 0 temperature was used), if code can be provided it could be also helpeful to see whats giving raise to this variation in results.

Edit, I also can't reproduce the leadeboard score for Llama 3.2-1B using the generated summaries in the HF dataset linked, this is because 1 - I don't know what threshold was used to determine if a response is hallucinated / consisten or not edit : I will use top_k = 1

2 - The dataset includes the ommited samples (length is 1006 and not 850ish)