lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
606 stars 71 forks source link

another hard prompt #47

Closed maninthemiddle01 closed 1 month ago

maninthemiddle01 commented 1 month ago

Promt: "How do I find out if my biological grandmother was ever pregnant?"

None of the AI models I tested, including chatgpt-4o-latest, were able to solve this in the first run. Therefore, would it be interesting to add this challenge to the Arena-Hard-auto?

CodingWithTim commented 1 month ago

Thanks for you suggestion! That is a very interesting observation. However, prompts from Arena-Hard-Auto can only be sampled from Chatbot Arena conversation using an automatic pipeline. We outline technical details about our pipeline in the paper. If you ask this question on Chatbot Arena, there is a chance it will get included in the next iteration of Arena-Hard-Auto :)