[Request] Integrate web retrieval

kfsone commented 10 months ago

While most LMs will allow autogen to self-implement something to fetch URLs, it's absolutely not guaranteed and certainly not guaranteed to be safe. Looking at the patterns and rate of growth of models on HF there are probably already nefariously manipulated models being used already.

Given code execution is involved, exploits are inevitable, but having a web-capable baseline agent/facility might help delay and reduce risk exposure.

rickyloynd-microsoft commented 10 months ago

@kfsone, as you say, there are risks. What's your recommendation for mitigating them?

afourney commented 10 months ago

I agree that relying on Autogen to write and run code to fetch urls, parse pages, and do web searches, is suboptimal -- not necessarily for security reasons, but mainly because it increases the risk of failure. Simply stated: more things need to go right for the task to succeed.

I have been thinking about creating a stateful WebSurferAgent that can conduct searches and read pages, similar to https://openai.com/research/webgpt, but I've been too busy on evaluation to get to it yet. I would be very pleased if someone else took this on.

mathieuisabel commented 10 months ago

Not specifically on the topic of safety but related, I had to integrate precautions in the assistant to handle basic things like avoiding a site that can’t be scraped or could also be a site to avoid because of low content quality. I’m also mitigating this by additional logic in the search function the assistant uses. On the security side, perhaps some integration with something like Open Threat Exchange or another source of threat intelligence.

kfsone commented 10 months ago

@rickyloynd-microsoft Obviously it's a long hose that never runs dry, but I think the most likely source of risk atm is going to be unskilled/inexperienced people, say youtubers or people in or looking to be in government, coming along, firing up autogen on a machine and asking it to do something that requires web access. They'll likely use an LM or something and grab a model or 3 from huggingface.

They'll land on https://github.com/microsoft/autogen/blob/main/notebook/agentchat_web_info.ipynb and perhaps not understand that what it's doing is building its own web-access mechanism from code the models generate. That makes which model you use really significant, and I don't imagine many people - even your actual target audience for autogen - will intuit that.

What I'm suggesting here is really just Stage 0, adding an agent/feature/capability to autogen that provides the web-retrieval capability similar to the way you have an executor capability, and the notebook linked would be the primer for how to use that.

I've had to learn some tough lessons about accidental and intentional user abuse over the decades(*), so I've developed a reasonable spidey-sense for "this stupid tiny side branch is going to explode and break the project", and LM-will-use-lm-to-write-web-access-capability feels like a setup for one of those.

(* a small sampling: https://www.deafblind.com/dbtechies4.html 'web by mail', oh how naive I was; https://web.archive.org/web/19990208003211/http://about.warbirds.org/ allow people to create mail aliases and mailing lists for free, how could that go wrong, although it never actually got exploited while I was running it; https://web.archive.org/web/19980703072207/www.kfs.org/tools.html web accessible dns/ping/traceroute tools via cgi?)

sonichi commented 10 months ago

@kfsone I'll make sure adding you as a reviewer if I see a PR for it. At least, we should remove "Web Search" from the example prefix. The example doesn't really perform web search.

jtoy commented 6 months ago

Is this something we still want to move forward with? This will absolutely increase the risk of failure as it could break on so many things. That given agents need a way to search the internet. If we agree on what an MVP would be, I could look into building a first version.

afourney commented 6 months ago

Is this something we still want to move forward with? This will absolutely increase the risk of failure as it could break on so many things. That given agents need a way to search the internet. If we agree on what an MVP would be, I could look into building a first version.

Reliable web search is implemented in the current WebSurfer agent, and greatly expanded in the update I am preparing in #1929: https://github.com/microsoft/autogen/blob/headless_web_surfer/autogen/browser_utils/markdown_search.py

This uses a BING_API_KEY when available, or else falls back to scraping. https://github.com/microsoft/autogen/blob/852ee3375bca61fc1d0c004060439d0b4a906aad/autogen/browser_utils/mdconvert.py#L363-L426

jtoy commented 6 months ago

@afourney nice work. This looks fun to use! Why did you choose to use markdown?

afourney commented 6 months ago

@afourney nice work. This looks fun to use! Why did you choose to use markdown?

The first Web Surfer uses Markdown, so that's why the search integration uses markdown.

But then the question is why use Markdown for web surfer? Well:

Modern HTML / DOM spends a lot of tokens on layout and style... which isn't much help to a text-based LLM
OpenAI's models are already well-tuned to Markdown (the routinely produce it, and are use to it in their message histories)
Markdown preserves many important HTML semantics (titles, hyperlinks, lists, tables, and even images -- even if we can't see them (the alt text is available)

jtoy commented 5 months ago

@afourney yes that makes sense, Il recently worked on a system that was converting docs to MD and then sending it to an LLM. I converted it to text instead of MD and noticed the performance went down because the LLM lost meaning so I converted it back to MD!

microsoft / autogen

[Request] Integrate web retrieval #708