hkust-nlp / AgentBoard

An Analytical Evaluation Board of Multi-turn LLM Agents
219 stars 22 forks source link

Fix jericho grounding accuracy #11

Open OliBomby opened 3 months ago

OliBomby commented 3 months ago

The jericho environment does not detect invalid actions correctly for any of the games except 905. This means the reported grounding accuracy is way higher than the actual value.

It looks for [! to determine verb errors, but every game handles verb errors differently. Like acorncourt says That's not a verb I recognise., or library says I didn't understand that sentence., or zork1 says I don't know the word X, or loose says "I haven't the slightest idea what you're referring to," murmurs the egg. or "Huh?" The wolf raises an eyebrow., or snacktime says That's not a trick you know.. Partial valid verbs respond with I only understood you as far as wanting to X. I beg your pardon? if you say nothing.

This PR fixes the issue and reports the actual grounding accuracy in Jericho.