microsoft / autogen

A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap
https://microsoft.github.io/autogen/
Creative Commons Attribution 4.0 International
28.21k stars 4.13k forks source link

[Roadmap] Web Browsing #2017

Open afourney opened 3 months ago

afourney commented 3 months ago

[!TIP]

Want to get involved?

We'd love it if you did! Please get in contact with the people assigned to this issue, or leave a comment. See general contributing advice here too.

Background

Web browsing is quickly becoming a table-stakes capability for agentic systems. For several months, AutoGen has offered basic web browsing capabilities via the WebSurferAgent and browser_utils module. The browserutils provides a simple text-based browsing experience similar to [LYNX](https://en.wikipedia.org/wiki/Lynx(web_browser)), but converts pages to Markdown rather than plain text. The WebSurferAgent then maps incoming requests to operations in this text-based browser. For example, if one were to ask 'go to AutoGen's GitHub page', the WebSurferAgent would map the request to two function calls: web_search("autogen github"), and visit_page(url_of_first_search_result).

Markdown is convenient because modern HTML is very bloated, and Markdown strips most of that away, while leaving essential semantic information such as hyperlinks, titles, tables, etc. A simplified or restricted subset of HTML would likely have worked as well, but we take advantage of the fact that OpenAI's models are quite comfortable with Markdown.

An important design element of the WebSurferAgent is that it basically just maps natural language to browser commands, then outputs the text content of the virtual viewport as a message to other agents. In this way, it leaves all planning and interpretation to other agents in the AutoGen stack.

This arrangement is surprisingly powerful and led to our top submission on GAIA, but has some obvious limitations:

To grow AutoGen's web browsing capabilities and overcome the above-mentioned limitations, the following roadmap is proposed:

Roadmap

Enhanced Markdown browsing

Given the general simplicity and utility of the existing Markdown-based solution, and in the spirit of starting a to-do list with tasks already complete, PR #1929 proposed enhancing the Markdown browsing in AutoGen in the following ways:

- [x] Supplement the `requests` library with headless browsers powered by Selenium and/or Playwright. This allows them to appear as regular web browsers (e.g., in their User-Agents) and also to execute JavaScript before converting pages to Markdown.
- [x] Abstract web search, providing an easy way to replace Bing, and a means of operating without a Bing API key     
- [x] Allow the Markdown browsers to access the local file system (providing directory listings, opening documents, etc.)
- [x] Greatly expanding file format support (since we're already converting HTML to Markdown, why stop there? We can also convert pptx, docx, xlsx, pdf, etc.)

Importantly, 1929 combines ideas and code from numerous other PRs including #1534, #1733, #1832, #1572, and possibly others. Each author @vijaykramesh, @signalprime, @INF800, @gagb is credited here.

However, there is more to do on Markdown browsing before we can consider this wrapped up:

- [x] Tests need to be added to #1929
- [x] We need documentation, and the WebSurferAgent notebook needs to be updated
- [ ] All above-mentioned co-contributors are invited to co-author a blog post
- [ ] #1682 needs to be merged so that [read_page_and_answer](https://github.com/microsoft/autogen/blob/0a524834940defce079094591d9ed72539503981/autogen/agentchat/contrib/web_surfer.py#L269) can optimally do Q&A (this should be improved, or perhaps abstracted anyway)
- [ ] When Selenium falls back to requests to download files, we should take the User-Agent and Cookies from the browser and pass them to `requests`. #1733 already does this very nicely, and I would like to integrate that here.

Vision-based Interactive Browsing

As handy as it is, Markdown-based browsing will only ever get us so far. To address limitations two and three above, we need to take an interactive and multimodal approach similar to WebVoyager. Such systems generally work using Set-of-Mark prompting -- they take a screenshot of the web page, add labels and obvious bounding boxes to each interactive component, and then ask GPT-v to select elements to interact with via their visual labels. This solves the localization and grounding problem, where vision models have trouble outputting real-world coordinates (e.g., where a mouse should be clicked).

Here again, I want to acknowledge that @shauppi has already demonstrated an initial replication of the WebVoyager work, which is fantastic. I hope that we can maybe work together on this, but ultimately AutoGen likely needs a vision-based web surfing agent as part of its core offering.

Patterned after our existing WebSurferAgent, I propose that any MultimodalWebSurferAgent should adhere to the following design principle:

MultimodalWebSurferAgent should focus only on mapping natural language instructions to low-level browser commands (e.g., scrolling, clicking, visiting a page, etc.) and output both text and a screenshot of the browser viewport. All other planning will be left to other agents in the AutoGen stack.

Importantly, AutoGen is working to support multimodality through the agent stack, and by outputting both screenshots and page text in messages to other agents, we can then use MultimodalWebSurferAgent in many different agent configurations.

This is where the following roadmap and task list lead us:

- [x] Create a MultimodalWebSurferAgent, similar to WebSurferAgent, but that takes a `Playwright` page or browser context instead of a Markdown browser
- [x] Support the 7 core WebVoyager function calls: Click. Input. Scroll. Wait. Back. Web Search. Answer.
- [ ] Provide an abstraction for localizing elements in MLM responses, and provide Set-of-Mark bounding boxes and prompts as one implementation.
- [x] Related to set-of-mark prompting, explore the accessibility tree (AXTree) for focusable elements, and fully-resolved aria labels and roles (see existing proof of concept)
- [ ] Explore allowing the agent to write and run JavaScript code in the page's context (it would become a new execution environment for agents)
- [ ] Ensure compatibility with both vision-capable and text-only conversation partner agents (vision-capable agents should also receive screenshots)
- [ ] Ensure screenshots and page text snapshots are synchronized.
- [ ] Support downloads (perhaps falling back to the Markdown approach above)
gagb commented 3 months ago

Fantastic roadmap and description @afourney!

afourney commented 3 months ago

Fantastic roadmap and description @afourney!

I'm hoping it might form the basis of a blog post later.

afourney commented 3 months ago

@BeibinLi I'd love to hear your thoughts on this part of the design proposal in particular: image

Basically, any agents that MultimodalWebSurfer talks to should also be able to "see" the web page via the screenshots (if vision-capable), and direct MultimodalWebSurfer to take further actions (e.g., "Sort the table by cost.", "Scroll to the reviews section.", etc.)

gagb commented 3 months ago

Fantastic roadmap and description @afourney!

I'm hoping it might form the basis of a blog post later.

Yesssssss

BeibinLi commented 3 months ago

@afourney Yes, ideally it will work.

Do you want to use GPT-4V for MultimodalWebSurfer or for all agents? I think using GPT-4V for all agents might produce better results. See #2013.

Caveat: GPT-4V is not good enough for reading tables and other tasks, and we will need OCR and other rule-based methods for the VisionCapability, which will be then added to the other agents who will talk with the MultimodalWebSurfer.

afourney commented 3 months ago

@afourney Yes, ideally it will work.

Do you want to use GPT-4V for MultimodalWebSurfer or for all agents? I think using GPT-4V for all agents might produce better results. See #2013.

Caveat: GPT-4V is not good enough for reading tables and other tasks, and we will need OCR and other rule-based methods for the VisionCapability, which will be then added to the other agents who will talk with the MultimodalWebSurfer.

I was thinking all agents. But the text content of the message would also contain the text of the webpage from the dom (no need for ocr). So ideally any agent can consume it

BeibinLi commented 3 months ago

@afourney Got it! Then, yes, this design would work.

skzhang1 commented 3 months ago

Great design! One difficulty is to label each interactive components in web. Webvoyager seems to use a separate interactive segmentation model. Correct bounding boxes should be the pre-requirements to mapping natural language instructions to low-level browser commands. I think it is hard for GPT-4v to directly label each element.

afourney commented 3 months ago

Great design! One difficulty is to label each interactive components in web. Webvoyager seems to use a separate interactive segmentation model. Correct bounding boxes should be the pre-requirements to mapping natural language instructions to low-level browser commands. I think it is hard for GPT-4v to directly label each element.

Yes. Good point. I want to abstract this step so that we can substitute in different implementations. https://github.com/schauppi/MultimodalWebAgent has a good approach to this. I've also had reasonable results using the accessibility AXTree to enumerate interactive components (focusable, etc.). Once we know which elements are interactive, we can decorate them with the labels and outlines.

Tylersuard commented 3 months ago

I was going to say this but for GUIs

afourney commented 3 months ago

I was going to say this but for GUIs

General apps or GUIs would require a different mechanism to capture the window and generate events, but the principle would be very similar — we just wouldn’t have the DOM for perfect segmentation, and info.

skzhang1 commented 3 months ago

@afourney got it!

Tylersuard commented 3 months ago

Actually this may not be a good idea. If we can use agents to automate web browsing, how many jobs might be eliminated?

gasse commented 3 months ago

@afourney you should check out our recently released browsergym :) It is meant to be a flexible framework build upon playwright. It already supports most of the features you describe (AXTree, screenshots, different action spaces). Disclaimer: I am one of the authors of the library.

afourney commented 2 months ago

A quick update. PR #1929 is out of draft, and once merged, will complete much of the Markdown browsing items.

Work on the MultimodalWebSurfer is active and ongoing in the ct_webarena branch of the repo under autogen/autogen/contrib/multimodal_web_surfer. A standalone PR will be prepared once we've stabilized some of the larger issues (e.g., synchronizing text and screenshots)