mithril-security / blind_chat

A fully in-browser privacy solution to make Conversational AI privacy-friendly
https://chat.mithrilsecurity.io/
Apache License 2.0
223 stars 24 forks source link

Internet search #10

Open lyie28 opened 1 year ago

clauverjat commented 1 year ago

Integrating search into the chat while preserving user privacy is no small task. I've broken down my thoughts about the relevant challenges and how we could implement the feature.

But first a quick recap on how search is integrated into an AI chat.

Now, let's evaluate the privacy challenges with this approach: Dependence on a search engine: Since the user data is sent to the search engine, the users need to trust the search engine (and not only the local model or the enclave). Thus, the search feature should come with a big warning so that users understand the privacy implications of using the search. When they choose to use the search, we should consider using privacy-friendly search engines like DuckDuckGo, in order to maximize privacy.

Fetching web page contents: The challenge here is twofold. We are fetching the content of the returned web pages. This implies a potential exposure to the various site trackers or cookies. Sadly, we cannot do a lot of things against the tracker on the web pages. Also there is the issue of metadata leakage that can reveal which websites are queried (by inspecting the network of the service fetching the content). Here we have basically two options we can do the search locally or in an enclave.

Option 1 : Local implementation of the search ? I believe it's not technically feasible to do the search locally (in-browser) because the browser puts restrictions on what a webpage from a domain can fetch (CORS / Same-Origin Policy). If we still want to do it locally, we will need to provide either a browser extension or a desktop app. Those present challenges from an adoption perspective. So I don't think we should go down that road (at least for now). Anyway if the search is done locally, the web pages that are loaded would be subject to the same issues regarding metadata leakage as the rest of the user browsing activity. Notably, a network administrator could infer which websites/domain were searched/accessed based on the destination IP in the network packets. Still, this concern all his browser history so users concerned by this risk will need to use VPN (or Tor/I2P if VPN aren't enough).

Option 2: Use an enclave Like for the models we could call the search engine & load pages from an enclave. This option does not present the feasibility issue of the local implementation, we could implement it and integrate it into our existing web app. However, like for the local search, network related metadata, which can often be revealing, remains exposed. It might actually be more problematic since a malicious administrator can spy on the network interface of the enclave and analyze all its traffic. If there are many users, the concept of "anonymity set" can offer some relief : you can't link a particular website to a user, since multiple users use the enclave at the same time. But I think we should go further. To address the issue, and provide greater privacy against us (the service operator) we could use a trustworthy VPN like Mullvad. They are known to take privacy seriously (Mozilla partnered with them for their VPN), and, like us, they use remote attestation to attest their server's software stack. That way we really couldn't even know which websites are queried from the traffic, which is very nice!

dhuynh95 commented 1 year ago

Interesting insights @clauverjat

My feedback:

My assumption is that given that people already use web search a lot, both for professional and personal, and that good privacy solutions exist through VPNs as you mentioned or DuckDuckGo, the question is more: can the use of our service with web browser expose data to us, and in some way to those other services we rely on.

If we assume we just do synthesis generation and the user gets the content to be synthesized by the LLM in an enclave, then there is no more exposure to us than usual.

I think the easiest way to move forward is to ask our community what they want for search, and what they want to protect from whom but good mapping :)