Clean datasets for evaluation on The Wave and WebLinx

@HiImMadness : The dataset that we use for eval, The Wave 250, is broken:

Please always ensure datasets are working.

Also, it would be ideal if you upload a different dataset that contains the nodes of our best retriever, with metadata about this retriever if we want to make things more reproducible, so we can directly evaluate an LLM without having to rerun the retriever. Obviously, these examples must contain the ground truth elements to make sure the LLM can find the solution.

Something like this would be ideal:

from lavague.core.evaluator import LLMEvaluator
from lavague.contexts.openai import OpenaiContext
from lavague.core.navigation import NavigationEngine
from lavague.drivers.selenium import SeleniumDriver
import pandas as pd

llm_test_df = pd.read_parquet("hf://datasets/BigAction/the-wave-250-best-retrieved-nodes/data/test-00000-of-00001.parquet")
openai_engine = NavigationEngine.from_context(OpenaiContext(), SeleniumDriver())
llm_evaluator = LLMEvaluator() 
openai_results = llm_evaluator.evaluate(openai_engine, retrieved_data_opsm, "openai_results.csv")

Also, optionally it might be better to provide Full XPath to be more consistent as the way to select the ground truth element. I see sometimes different selectors, which can work but be ambiguous or inconsistent in some scenarios.

Todo:

[ ] Fix The Wave 250
[ ] Prepare a dataset for fast LLM Evaluation with already retrieved node
[ ] Make data more consistent

lavague-ai / LaVague

Clean datasets for evaluation on The Wave and WebLinx #385