Adding a Visual Navigation Engine using Set of Marks to possible Navigation Engine

dhuynh95 commented 1 month ago

Recent discussion with the community highlighted the interest in having a Navigation Engine able to choose elements from visual inputs.

As a reminder from our docs, we proceed in two ways today:

The user's global objective is handled by the World Model. It considers this objective along with the state of the webpage through screenshots and HTML code, and generate the next step, aka. text instruction, needed to achieve this objective.
This instruction is sent to the ActionEngine, which then generates the automation code needed to perform this step and executes it.
The World Model then receives new text and image data, aka. a new screenshot and the updated source code, to reflect the updated state of the web page. With this information, it is able to generate the next instruction needed to achieve the objective.
This process repeats until the objective is achieved.

Today, the Action Engine has 3 parts:

🚄 Navigation Engine: Generates and executes Selenium code to perform an action on a web page
🐍 Python Engine: Generates and executes code for tasks that do not involve navigating or interacting with a web page, such as extracting information
🕹️ Navigation Control: Performs frequently required navigation tasks without needing to make any extra LLM calls. So far we cover: scroll up, scroll down & wait

The current Navigation Engine, which does the bulk of the effort, use RAG on the current DOM to generate the next action.

Following the proposal, we could split this in two:

DOM Navigation Engine: the current one
Visual Navigation Engine: the proposed one that uses Set of Marks where we highlight interactive elements and ask a MLLM to choose the right one

@adeprez : this seems fairly straightforward. We already have code to get the interactive elements. We should just provide a screenshot after we highlight all of those and then ask a MLLM to output the ID.

Can you share more details for others to take this one?

To do:

[ ] Provide code to get a screenshot of all interactive elements with ID
[ ] Ask MLLM to choose which one is the best
[ ] Map the ID to an action

adeprez commented 1 month ago

Currently, the drivers offer a get_possible_interactions method that returns a dictionary. In this dictionary, the keys are xpaths, and the values are lists of interaction names (CLICK / TYPE / HOVER). We can use this feature to map each xpath to a unique ID. This ID can then be added to the output image using the element's bounding box.

While the NavigationEngine currently uses xpaths, the new VisualNavigationEngine implementation will use IDs instead.

dhuynh95 commented 1 month ago

I ran this code which gave me highlighted elements on the screen on https://orcid.org/0000-0001-6102-7846:

from lavague.core.base_driver import BaseDriver

def highlight_element(element, driver: BaseDriver):
    driver.execute_script(
        "arguments[0].setAttribute('style', arguments[1]);",
        element,
        "border: 2px solid red;",
    )

xpaths = driver.get_possible_interactions().keys()

from selenium.webdriver.common.by import By

for xpath in xpaths:

    element = driver.driver.find_element(By.XPATH, xpath)
    highlight_element(element, driver)

Two issues I noticed though:

[ ] Some xpaths do not allow for selection
[ ] It can take a while to highlight every element.

To answer the second point, I do not know if my code has to be optimized easily, and/or I guess if we do Set of Marks we should only filter elements that are currently visible as it's meaningless to highlight non visible elements in the screenshot that will be used

dhuynh95 commented 1 month ago

Update: Added numbering of elements:

from lavague.core.base_driver import BaseDriver

def highlight_element(element, driver: BaseDriver):
    driver.execute_script(
        "arguments[0].setAttribute('style', arguments[1]);",
        element,
        "border: 2px solid red;",
    )

xpaths = driver.get_possible_interactions().keys()

from selenium.webdriver.common.by import By

elements = []

for id, xpath in enumerate(xpaths):
    try:
        element = driver.driver.find_element(By.XPATH, xpath)
        elements.append(element)
    except:
        continue
    highlight_element(element, driver)

def add_id_overlays(driver, elements):
    js_script = """
    function addIdOverlay(element, id) {
        const rect = element.getBoundingClientRect();
        const overlay = document.createElement('div');
        overlay.textContent = id;
        overlay.style.position = 'absolute';
        overlay.style.backgroundColor = 'rgba(255, 0, 0, 0.7)';
        overlay.style.color = 'white';
        overlay.style.padding = '2px 5px';
        overlay.style.borderRadius = '3px';
        overlay.style.fontSize = '12px';
        overlay.style.zIndex = '10000';
        overlay.style.pointerEvents = 'none';  // Ensure it doesn't interfere with clicks

        // Position the overlay at the top-left corner, outside the element
        overlay.style.left = (rect.left - 25) + 'px';
        overlay.style.top = (rect.top - 25) + 'px';

        // Adjust position if too close to the left edge
        if (rect.left < 30) {
            overlay.style.left = rect.right + 'px';
        }

        // Adjust position if too close to the top edge
        if (rect.top < 30) {
            overlay.style.top = rect.bottom + 'px';
        }

        document.body.appendChild(overlay);
    }

    const elements = arguments[0];
    for (let i = 0; i < elements.length; i++) {
        addIdOverlay(elements[i], i);
    }
    """
    driver.execute_script(js_script, elements)

add_id_overlays(driver, elements)

which gave

I tried it on my ChatGPT account with prompt

You are an AI assistant tasked with helping navigate web interfaces.
You are provided with screenshots with different elements already highlighted and their corresponding ID. 
Your goal is to output the ID of the right element to interact with.

Here are previous examples:
Instruction: Click on Sign In / Register

Thoughts:

The interface displays multiple elements highlighted with IDs.
I identify the "Sign in / Register" button located at the top right.
The button is highlighted with the ID "37".
This is the element the user likely wants to interact with.
Output: 37

Instruction: Show more details for the first article

Thoughts:

The interface shows two articles listed under "Works".
I identify the "Show more detail" link for the first article.
This link is highlighted with the ID "74".
This is the element to interact with according to the instruction.
Output: 74

Instruction: Collapse all sections

Thoughts:

The interface includes a "Collapse all" button.
I locate the "Collapse all" button at the top right of the Activities section.
The button is highlighted with the ID "98".
This is the element to interact with for collapsing all sections.
Output: 98

Instruction: Click on the "Works (12)" section to expand it

Thoughts:

The "Works (12)" section is displayed under the Activities header.
I see the section is highlighted with an ID and includes an expand/collapse arrow.
The section is highlighted with the ID "69".
This is the element to interact with to expand the section.
Output: 69

Instruction: Click on Show record summary

Thoughts:

The interface includes a "Show record summary" link.
I locate the link on the right side below the profile information.
The link is highlighted with the ID "42".
This is the element to interact with according to the instruction.
Output: 42

Instruction: Click on the English dropdown to change language

Thoughts:

The interface shows a language dropdown labeled "English".
I find the dropdown at the top right next to the "Sign in / Register" button.
The dropdown is highlighted with the ID "35".
This is the element to interact with for changing the language.
Output: 35

Here is the current example to classify:

Instruction: Click on https://doi.org/10.1007/s10722-024-01921-8

Thoughts:

And it kind of worked:

lavague-ai / LaVague

Adding a Visual Navigation Engine using Set of Marks to possible Navigation Engine #436