Internet search - Githubissues

Integrating search into the chat while preserving user privacy is no small task. I've broken down my thoughts about the relevant challenges and how we could implement the feature.

But first a quick recap on how search is integrated into an AI chat.

The user message is first analyzed with the LLM to produce the search query.
This query is then sent to a (regular) search engine (ChatGPT uses Bing, Bard uses Google Search)
The top links returned by the search engine are fetched (this means that the system needs to download/browse these web pages)
Contents of these web pages along with the initial user message are fed into the LLM to produce a response

Now, let's evaluate the privacy challenges with this approach: Dependence on a search engine: Since the user data is sent to the search engine, the users need to trust the search engine (and not only the local model or the enclave). Thus, the search feature should come with a big warning so that users understand the privacy implications of using the search. When they choose to use the search, we should consider using privacy-friendly search engines like DuckDuckGo, in order to maximize privacy.

Fetching web page contents: The challenge here is twofold. We are fetching the content of the returned web pages. This implies a potential exposure to the various site trackers or cookies. Sadly, we cannot do a lot of things against the tracker on the web pages. Also there is the issue of metadata leakage that can reveal which websites are queried (by inspecting the network of the service fetching the content). Here we have basically two options we can do the search locally or in an enclave.

Option 1 : Local implementation of the search ? I believe it's not technically feasible to do the search locally (in-browser) because the browser puts restrictions on what a webpage from a domain can fetch (CORS / Same-Origin Policy). If we still want to do it locally, we will need to provide either a browser extension or a desktop app. Those present challenges from an adoption perspective. So I don't think we should go down that road (at least for now). Anyway if the search is done locally, the web pages that are loaded would be subject to the same issues regarding metadata leakage as the rest of the user browsing activity. Notably, a network administrator could infer which websites/domain were searched/accessed based on the destination IP in the network packets. Still, this concern all his browser history so users concerned by this risk will need to use VPN (or Tor/I2P if VPN aren't enough).

Option 2: Use an enclave Like for the models we could call the search engine & load pages from an enclave. This option does not present the feasibility issue of the local implementation, we could implement it and integrate it into our existing web app. However, like for the local search, network related metadata, which can often be revealing, remains exposed. It might actually be more problematic since a malicious administrator can spy on the network interface of the enclave and analyze all its traffic. If there are many users, the concept of "anonymity set" can offer some relief : you can't link a particular website to a user, since multiple users use the enclave at the same time. But I think we should go further. To address the issue, and provide greater privacy against us (the service operator) we could use a trustworthy VPN like Mullvad. They are known to take privacy seriously (Mozilla partnered with them for their VPN), and, like us, they use remote attestation to attest their server's software stack. That way we really couldn't even know which websites are queried from the traffic, which is very nice!

mithril-security / blind_chat

Internet search #10