API endpoints insufficient for mirroring

djspiewak commented 9 years ago

So I've started to more seriously toy with the idea of using one of my servers as a true, publicly-accessible third party mirror for Keybase. Unfortunately, it looks like the API as it currently stands does not provide sufficient endpoints for achieving this goal. Or rather, I should say that it doesn't provide sufficient endpoints for achieving this goal with any degree of network efficiency.

The core problem is that there is no (good) way to enumerate the set of all users. It's technically possible to trick the API into this functionality via the /autocomplete endpoint, but doing so would require trial-and-error enumeration of various alphabetical prefixes, each of which entailing an HTTP round-trip. Terrifying to say the least.

What I would rather have is some way to stream the directory of all user JSON objects from Keybase to my server in a relatively direct fashion. Presumably, this would be a very large amount of data (hence the streaming), and requests to this endpoint could and should be throttled significantly. As an additional safety measure, you could attach an API key validation to this special endpoint, making it slightly easier to block abusers.

In general though, I'm pretty sure this is a requisite first step to getting real read-only mirroring in place.

maxtaco commented 9 years ago

Thanks for taking a look at this. Have you checked out the merkle root and block endpoints?

I was thinking mirroring would work via depth-first traversal of the merkle tree. That will get you the initial state, and then you can poll for updates. Once you see there's an update, you can do a simple diff of the new tree and the old pretty easily (there will only be one leaf difference).

Let me know if anything else comes up or how we can help.

djspiewak commented 9 years ago

Ah, the Mekle tree is even better than what I was envisioning! However, unless I'm missing something obvious, it doesn't seem that the leaves give any information which can then be passed to user/lookup. Specifically, the lookup endpoint doesn't allow lookups by user id, which would probably be the most useful way to tie the two together.

This of course runs face-first into exactly what a read-only mirror would accomplish. I was thinking of mirroring a few things.

The Merkle tree itself (duh)
All user objects in json form
All ASCII armored public keys
Track proofs

I was envisioning a simple mirror script that would bring all of this in as static files. Unfortunately, this immediately means that such a mirror would be unable to replicate the exact Keybase API. For example, the search functionality of user/lookup wouldn't work with this sort of mirror. Instead, you would need to do something like users/<id>.json. That's not actually a particularly bad thing, because the ids are unique and can be validated against the mainline API, but it does make it a bit less "drop in", since it creates a hard distinction between the read-only API and a "full" API. This distinction is annoying, but I think the benefits are worth the price: namely, this sort of script serving things up from a static filesystem makes it very, very easy for anyone to run their own read-only mirror.

The second problem with this is the API doesn't seem to expose tracking proofs? Or at least, I couldn't find them anywhere.

In other news, I put some thought into what sort of validation and trust benefits can be gained from this. My first thought is that, obviously, a well-behaved mirror script shouldn't take the upstream API's word for things. So, the Merkle hashing should be checked with each change, and all updated signatures should also be verified. Additionally, the mirror should publish its own version of the Merkle root, identical to the upstream API but signed using the mirror's key.

The more interesting element of all this is it gives clients a secondary source to check the upstream API against. Ideally, clients should be aware of many, many read-only mirrors. Whenever they read information from the upstream API, they should randomly select a couple read-only mirrors and fetch the same user records. In a truly paranoid mode, they would also validate the path through that server's Merkle tree, as well as the root signature. This cross-checking distributes the trust a bit. Potentially all of the read-only mirrors would need to be compromised by a malicious entity seeking to modify an entry. Additionally, even a single well-behaved mirror would be sufficient to detect tampering on the upstream server's side, reducing the trust bestowed on the upstream API.

Anyway, these are random thoughts. The products of a few afternoons of noodling. Let me know if this does (or doesn't) line up with what you were envisioning with respect to read-only mirrors.

maxtaco commented 9 years ago

I think a downstream mirror could answer user/lookup.json?username=max. It could reconstruct the username <-> uid mapping (which is static and 1-to-1) by looking at individual users' signature chains.

The API does expose tracking proofs. For example, search for "apg" in my signatures. Tracking proofs are signed statements like any other, and wind up in the user's signature chain. The downstream mirror can also recheck the external proofs as the server does (by making curl requests to reddit, twitter, etc..)

Agreed about the security benefits! Your understanding does line up with ours. Thanks for taking the time to hack on this, we greatly appreciate it.

djspiewak commented 9 years ago

I think a downstream mirror could answer user/lookup.json?username=max. It could reconstruct the username <-> uid mapping (which is static and 1-to-1) by looking at individual users' signature chains.

It's doable with some aliasing. Keeping the mirror purely within the realm of static files (for better performance properties and ease of setup), I think a bit of symbolic linking should do the trick. It would be sensitive to invalid parameters though, since there would actually need to be a raw file on disk named lookup.json?username=max.

The API does expose tracking proofs. For example, search for "apg" in my signatures. Tracking proofs are signed statements like any other, and wind up in the user's signature chain.

Ah, the sigs/get endpoint was what I was missing. Is this one documented? I see documentation for various other sigs-related stuff, but not get. Also, I'm still missing a way of mapping user IDs back to usernames, outside of potentially parsing the sigs/get output?

The downstream mirror can also recheck the external proofs as the server does (by making curl requests to reddit, twitter, etc..)

Indeed it should.

Thanks for taking the time to hack on this, we greatly appreciate it.

My pleasure! I think it's important for Keybase to have external, clean room third-party mirrors. Without them, I really doubt that the most paranoid among us will ever trust the service. Not that there's anything wrong with that…

keybase / keybase-issues

API endpoints insufficient for mirroring #1266