hkust-nlp / AgentBoard

An Analytical Evaluation Board of Multi-turn LLM Agents
219 stars 22 forks source link

wrong data in tool-query #14

Closed JimSalesforce closed 1 month ago

JimSalesforce commented 2 months ago

I identify a wrong test data point in tool-query data, which is the number 53. {"task": "tool-query", "id": 52, "goal": "Which year in the previous 3 years had the most snowfall on December 1st? Please provide the answer in the YYYY format.", "subgoals": ["2014-11-01", "New York", {"results": [{"name": "New York", "latitude": 40.71427, "longitude": -74.00597, "country_code": "US"}, {"name": "York", "latitude": 40.86807, "longitude": -97.592, "country_code": "US"}, {"name": "Clinton", "latitude": 42.55779, "longitude": -88.86511, "country_code": "US"}]}, {"latitude": 40.699997, "longitude": -74.0, "daily_units": {"time": "iso8601", "snowfall_sum": "cm"}, "daily": {"time": ["2011-12-10"], "snowfall_sum": [0.0]}}, {"latitude": 40.699997, "longitude": -74.0, "daily_units": {"time": "iso8601", "snowfall_sum": "cm"}, "daily": {"time": ["2012-12-10"], "snowfall_sum": [0.0]}}, {"latitude": 40.699997, "longitude": -74.0, "daily_units": {"time": "iso8601", "snowfall_sum": "cm"}, "daily": {"time": ["2013-12-10"], "snowfall_sum": [4.41]}}, "2013"], "additional_info": {"answer": "2013", "init_config": {"current_date": "2014-11-01", "current_location": "New York"}, "goal_type": 0, "tool": "weather"}, "difficulty": "hard"} The goal is to ask December 1st. However, the ground truth contains the time as "2013-12-10", "2011-12-10" and "2012-12-10", which seems to be a hallucination results from LLM?

chang-github-00 commented 2 months ago

Thanks for the bug reporting! We will update the data this week.

yc1999 commented 1 month ago

We have uploaded a new data.tar.gz file, thanks a lot for making AgentBoard better!