Closed ogencoglu closed 2 months ago
Specifically what is the difference of:
The annotation where source_ds is DROP follows the same format as DROP dataset (https://huggingface.co/datasets/ucinlp/drop/viewer/default/train?p=1&row=173). For some of the questions, there are multiple answers. For eg. in the screenshot you posted, it would be interpreted as Maria Sharpova, Venus Williams and Svetlana as having lost the Australian Open matches to Petrova.
For some of the questions, there are multiple answers.
That's exactly my question. How can a question as "How many yards did Donovan McNabb accummulate on passing touchdowns?"
have multiple acceptable answers (in this case 15
and 81
)?
And did you read the passage
field of my screenshot? It has nothing to do with the question or answer or tennis or Petrova. Even if it was related, the label is FAIL
. How can it be interpreted as:
would be interpreted as Maria Sharpova, Venus Williams and Svetlana as having lost the Australian Open matches to Petrova
?
The PASS samples are taken directly from the DROP dataset (https://huggingface.co/datasets/ucinlp/drop). Here is a screenshot of the original sample from DROP:
It looks like there is an error with the original dataset sample.
The FAIL samples are constructed by perturbing the answer given in the original sample. For eg: in the tennis examples, this is the original sample:
There seems to be an error in the original dataset where they have included an incorrect passage. Just want to point out that for the FAIL case, even if the passage is irrelevant, it is a valid case of hallucination as the answer here is not supported by the context.
We did carry out human annotation on a sample of HaluBench and saw high human agreement scores, however there might still be some errors that are propagated from our source datasets.
OK thanks for clarification.
Still, it is YOUR responsibility to perform a sound and reliable benchmarking as you are claiming to be "State-of-the-Art" (directly quoting all your marketing materials). This is especially important when your company is operating in the reliability and trust space. Are you planning to say "original dataset was flawed" when your customers realize the reality?
Also this is a joke of an excuse:
Just want to point out that for the FAIL case, even if the passage is irrelevant, it is a valid case of hallucination as the answer here is not supported by the context.
Anyone can synthetically create random passages about some topic and random answers about some other topic and get perfect FAIL
accuracy very easily with that logic.
Hi @ogencoglu,
First of all, thank you for pointing out this error from DROP! We appreciate your feedback and will take this into consideration in future projects.
We care strongly about benchmarking and dataset quality, and spent considerable time on human annotation in the construction of HaluBench. Unfortunately, there are few datasets existing for hallucination evaluation. In selecting source datasets to construct novel examples, we chose well vetted datasets such as DROP, which has been widely cited and used in the research community. That being said, even well vetted datasets can have issues, evidenced by your finding here.
When creating novel datasets, we conduct internal audits to ensure it meets our bar for quality. We did not audit DROP because auditing and fixing existing datasets was out of scope for this project, which was intended as a research and open source contribution. There are researchers doing important work in the field of dataset audits, for example you may find this study interesting: https://aclanthology.org/2021.acl-long.81/. We encourage more work in this direction.
Apologies for any confusion or inconvenience this may have caused. If you have other feedback, you are welcome to email me at rebecca@patronus.ai :)
Can you clarify the annotation process and format of
HaluBench
because it has full of inconsistencies and errors such as: