lavague-ai / LaVague

Large Action Model framework to develop AI Web Agents
https://docs.lavague.ai/en/latest/
Apache License 2.0
4.94k stars 420 forks source link

Clean datasets for evaluation on The Wave and WebLinx #385

Closed dhuynh95 closed 5 days ago

dhuynh95 commented 6 days ago

@HiImMadness : The dataset that we use for eval, The Wave 250, is broken:

image

Please always ensure datasets are working.

Also, it would be ideal if you upload a different dataset that contains the nodes of our best retriever, with metadata about this retriever if we want to make things more reproducible, so we can directly evaluate an LLM without having to rerun the retriever. Obviously, these examples must contain the ground truth elements to make sure the LLM can find the solution.

Something like this would be ideal:

from lavague.core.evaluator import LLMEvaluator
from lavague.contexts.openai import OpenaiContext
from lavague.core.navigation import NavigationEngine
from lavague.drivers.selenium import SeleniumDriver
import pandas as pd

llm_test_df = pd.read_parquet("hf://datasets/BigAction/the-wave-250-best-retrieved-nodes/data/test-00000-of-00001.parquet")
openai_engine = NavigationEngine.from_context(OpenaiContext(), SeleniumDriver())
llm_evaluator = LLMEvaluator() 
openai_results = llm_evaluator.evaluate(openai_engine, retrieved_data_opsm, "openai_results.csv")

Also, optionally it might be better to provide Full XPath to be more consistent as the way to select the ground truth element. I see sometimes different selectors, which can work but be ambiguous or inconsistent in some scenarios.

Todo:

lyie28 commented 5 days ago

Datasets now uploading in classic parquet format:

Raw dataset: https://huggingface.co/datasets/BigAction/the-meta-wave-raw Pre-processed dataset for Retriever evaluation: https://huggingface.co/datasets/BigAction/the-meta-wave-rewritten Retrieved dataset for LLM evaluation: https://huggingface.co/datasets/BigAction/the-meta-wave-retrieved