omnivore-app / omnivore

Omnivore is a complete, open source read-it-later solution for people who like reading.
https://omnivore.app
GNU Affero General Public License v3.0
12.63k stars 638 forks source link

Make Omnivore a zero-knowledge E2EE service #1658

Open jsynowiec opened 1 year ago

jsynowiec commented 1 year ago

In cloud computing, the term zero-knowledge (or occasionally no-knowledge or zero access) refers to an online service that stores, transfers or manipulates data in a way that maintains a high level of confidentiality, where the data is only accessible to the data's owner (the client), and not to the service provider. This is achieved by encrypting the raw data at the client's side or end-to-end (in case there is more than one client), without disclosing the password to the service provider.

https://en.wikipedia.org/wiki/Zero-knowledge_service

Because self-hosting is not easy (https://github.com/omnivore-app/omnivore/issues/25), and both, the iOS and macOS apps currently don't even allow pointing them to a self-hosted instance, please consider switching to zero-knowledge model by introducing end-to-end encryption of all user's data. By user's data, I understand all data that is created or stored by the user, or in any way derived of that data. Among others, this includes URLs, highlights, notes, and labels.

For reference, there are open-source platforms like Notesnook, Standard Notes, Joplin and Turtl that implement end-to-end encryption and more or less deal with the same challenges as storing, accessing, and searching text. Another one is Skiff. It includes e2ee mail, documents, and cloud storage. Their search component works pretty well. Or checking how Proton does it in the mail app or the SMTP bridge. You can also look at the Proton's OpenPGPjs and GopenPGP libraries that they are using and maintaining.

Some additional articles why E2EE is important:

ashutoshsaboo commented 1 year ago

Huge +1 on this. Hi @jacksonh , are there any plans to implement this? This would be gold if it can be added. Would make omnivore one of it's kind and the read-it-later app that almost everyone resorts to -- it's still in my personal view the best of all read it later apps out there, but with e2ee would just make it stand apart by quite a distance. Would be really really amazing if this can be added!

Would be great to hear your thoughts on this! I understand might require quite a bit of changes in the short term, but even if say medium to long term, if this can be something potentially added to Omnivore?

jacksonh commented 1 year ago

Hey, thanks for bumping this as I have been thinking about it more lately.

The biggest issue I see is a lot of the content for a read-it-later app is actually generated on the backend. These are some of the main sources of content:

All these forms of content are generated on the backend. In the case of save URL, a URL is sent to our backend, it creates a browser context, fetches the page content, and then makes it readable. In this scenario, I'm not sure how E2EE could be implemented, even if the URL was sent to us zero knowledge, we would have to run the browser instance and fetch the plain text content.

Obviously newsletters and RSS have the same issue. The one place we could easily client side encrypt is when the browser extension or the iOS share extension are used.

I've been trying to think of ways we could make something zero knowledge but haven't come up with anything yet.

User generated content like highlights and notes would be easier, but I'm not sure of the value.

jsynowiec commented 1 year ago

@jacksonh alas related, do not mix client-side encryption with e2ee and zero-knowledge. They are not the same and shouldn't be used interchangeably. In e2ee key pairs are used to securely exchange messages between a sender and recipient by making the exchange not susceptible to an intermediary reading it. Public keys are used to encrypt information, and private keys are used to decrypt. Each message is encrypted by the sender the recipient’s public key, and can only be decrypted by the private key on the recipient’s device. Client-side encryption only means that the data is encrypted on the client. And I quoted the zero-knowledge explanation in the first post.

If backend processing is involved, one could argue if zero-knowledge is achievable. However, you could still design and implement this in a way that you treat your worker processes' context as another e2ee client. Exchange keys, fetch and process data in a way that you only know what you are processing, but not for whom. Then have an exchange where a "reader" client anonymously and securely can fetch the data from the exchange then store it in its context (and possibly upload re-encrypted to the backend with any other user-created content). You would split the problem into two domains, one is e2ee exchange and storage of the data, the other is fetching and processing of the articles. When done correctly, you can fetch the contents without knowing for whom you process it, then have it exchanged anonymously and securely using keys in a way that you maintain that (somewhat) zero knowledge. It provides the additional benefit of being able to dedupe and don't process the same article multiple times on the server.

The above is just a thought experiment on my end. A rough idea of a trade-off I was considering for a different hobby project, that definitely has loose ends and loopholes, but maybe you could adapt it for Omnivore.

jacksonh commented 1 year ago

Thanks, what I'm not understanding here is how you can really achieve a zero knowledge system without client side encryption.

Maybe taking a step back, what is the goal with this system? Could you explain a scenario where encrypting a message to save a URL would be different from making an HTTPS call? In the E2EE scenario the Omnivore client is the sender and the Omnivore backend is the recipient, correct?

jsynowiec commented 1 year ago

That's why I wrote, that if achieving zero-knowledge encryption is not possible due to how Omnivore is processing data, then a viable trade-off could be adding e2ee and striving for knowing close to zero on the user's content by having all user-created content encrypted on the client device, and by delivering the processed raw text in a way that does not allow the service provider to link a URI to a specific user or PII. I suggested e2ee and key exchange for the latter, as it would allow exactly that. Efficiency aside, the client device could provide a different public key for every request and instead of session auth, you could maybe sign requests using both sides' keys so that there is no identifiable data in the session?

jacksonh commented 1 year ago

Thanks one clarifying question: who do you define as service provider here? I think there are a few potential providers I can think of:

This wouldn't be zero knowledge but if we did immediately encrypt content after fetching on the backend it would at least eliminate the ability to lookup history. It does mean we'd have to redo our search solution and wouldn't be able to perform migrations on data.

jsynowiec commented 1 year ago

From the user's point of view, they don't really care about the hosting used to provide the service by the operators, so, in this case, the "service provider" for me would be Omnivore Inc or whoever decides to host the backend and offer it as a service. I don't even count network provider as in-flight encryption is nowadays something you ought to have, rather than boast that you have it :-)

Another topic is whether the service provider is obliged to use a specific hosting because of laws and regulations (data residency, etc.). But, that's not a part of this discussion.

My main point is, that I, as a user, would want to have my data private. Meaning, I don't want others to be able to access my content (highlights, notes, tags, etc.), or use this data for any profiling or tracking (you can profile people based on the content they read). I don't really care how it's done. Whether it's stored on my local device and never leaves it (like with Obsidian), in a cloud service I trust, or it is e2e encrypted and stored on a sync server.

thiswillbeyourgithub commented 1 year ago

Personnaly i would be okay with hosting it myself as a nextcloud app. There are no alternatives afaik and that would fill a great need!

E2e would be nice but in the end self hosting is not easy whereas nextcloud is really easy to install and many provider actually offer nextcloud server hosting (hetzner, framasoft, ...).

mojo-jojo-7 commented 1 year ago

This would be amazing! Any updates regarding this?