I noticed that sometimes on some websites we retrieve elements that are visible but have exactly the same bounding box (aka they are overlapping?), which can introduce noise to the LLM.
Here is an example without removing duplicates:
Here is an example with unique elements based on bounding boxes:
We can see we have almost 10 times less tokens after removing duplicated elements for the same highlighted elements!
@adeprez : note this PR requires extract_xpaths_from_html that I introduced in #573
I noticed that sometimes on some websites we retrieve elements that are visible but have exactly the same bounding box (aka they are overlapping?), which can introduce noise to the LLM.
Here is an example without removing duplicates:
Here is an example with unique elements based on bounding boxes:
We can see we have almost 10 times less tokens after removing duplicated elements for the same highlighted elements!
@adeprez : note this PR requires
extract_xpaths_from_html
that I introduced in #573