rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.86k stars 2.43k forks source link

Sparse registry indexes should be viewable atomically #10928

Open Eh2406 opened 2 years ago

Eh2406 commented 2 years ago

Problem

With git-based registries any one commit from the registry represented an atomic picture of what all packages were up to at that point in time. This had a number of useful properties:

None of these are possible with the current design of the sparse http-based registries.

Proposed Solution

Its possible to re-obtain the list of all available crates by adding a names-file that lists all crates alphabetically at the root of the index. For crates.io this file would be large but probably compress well enough to be usable.

If we add to that names-file the hash of each index file, then it uniquely identifies a snapshot of the index! Of course, being mostly pseudorandom numbers it will compress really badly. At crates.io scale it will be too big to live in one file. We could split it up into several files, say one per folder in the current index structure. In order to get an atomic picture we would now need a meta-file recording the hashes of the names-files. This works but would be a lot of overhead for smaller registries. (An RFC for this idea will require describing if/how the number of hash files is configurable.)

What happens if the index file is updated after the names-file cargo was looking at? cargo will get a index file whose hash does not match the atomic picture requested. So there needs to be some kind of cash buster that makes sure that the received version of the file matches the requested version. We could put this in the name 3/f/foo can become 3/f/foo-abcdef123456. However this would require crates.io to have the same content at both of those locations (assuming stabilization of the current design of #2789), which is just asking for things to come out of sync. We could use a query parameter, 3/f/foo get you the latest version no matter the hash and 3/f/foo?hash=abcdef123456 gets you the version with the hash abcdef123456. However this requires the backing server to be able to look up versions of a file by their hash. Crates.io is currently using S3, which does not provide this functionality; you can look up old versions of a file but only by the versionID S3 generated. So let's double the size of our names-files by recording for each crate a cash busting string (crates.io can use S3s versionID) and the hash. With the semantics that if you ask for that crate with the cash buster appended, you should always receive a response with that hash.

Notes

If this is so much better design than the current sparse registry proposal, why should we ever stabilize the current proposal?

Fundamentally this design is going to be more work for server to implement. The current sparse registry design has exactly the same contents as a git registry, and is therefore very easy for registries to upgrade to. There will be many users for whom the additional complexity is not needed, and the current sparse registry proposal is a better fit.

I did want to open this discussion to make sure we have a path to this more atomic design that is compatible with what we are stabilizing in sparse registries. If changes need to be made to sparse registries, now is the time.

tarcieri commented 1 year ago

Coming from https://github.com/withoutboats/rfcs/pull/7 I was asked how The Update Framework (TUF)'s snapshots feature might help here. For context, TUF is a framework for cryptographically authenticating software updates, and its snapshots feature provides git-like consistency across several files.

TUF has 4 roles:

Separately TUF also supports a notion of delegated targets where you can specify policies for how 3rd party keys are allowed to sign specific files, which would be useful for letting end users sign crates, but that's not necessary to implement to use TUF for the sparse index and something that can be pursued as followup work as part of e.g. a Sigstore integration (see https://internals.rust-lang.org/t/pre-rfc-using-sigstore-for-signing-and-verifying-crates/18115/2)

If you're curious if it's bad for @bors to hold three roles, not necessarily. TUF splits them up to give you the option of eventually factoring them into separate systems which reduce the overall attack surface of each system and cross-check each other. But you can assign several roles to @bors to get started, and just having the root role separate gives you survivable compromise of @bors.

Eh2406 commented 1 year ago

Does TUF have any advice/recommendations on how we get from a bunch of files to a "snapshot"? The link suggests that the snapshot should include a list with all the files each with its hash, but this is impractical at the scale of crates.io. For GIT indexes we already have a Merkel tree, so the snapshot can just be the commit hash (or the tree hash) as suggested in your RFC. For sparse registries we will probably have to invent our own thing, but it might as well learn lessons from and take advice from those who went before us.

tarcieri commented 1 year ago

You'd probably need to talk to a TUF expert about that. I can point one at this issue if you'd like.

It might be possible to use delegated targets as something akin to git tree objects.

mnm678 commented 1 year ago

I'm a TUF maintainer and happy to answer any questions here.

Does TUF have any advice/recommendations on how we get from a bunch of files to a "snapshot"? The link suggests that the snapshot should include a list with all the files each with its hash, but this is impractical at the scale of crates.io. For GIT indexes we already have a Merkel tree, so the snapshot can just be the commit hash (or the tree hash) as suggested in your RFC. For sparse registries we will probably have to invent our own thing, but it might as well learn lessons from and take advice from those who went before us.

This is very similar to a proposal in TUF for this exact problem. The short version is that you can make separate "snapshots" for each file (with the name, version number, and hash), then combine them using a Merkle tree, and sign the root of that Merkle tree with the timestamp role. A sparse Merkle tree would likely make this more efficient, but give the same properties. The proposal is a draft that is mostly pending real-world implementation and validation.

Eh2406 commented 1 year ago

That is an extremely interesting read. Thank you! If I read it correctly it has the contents of the files (and the files metadata) put in a Merkel tree, and then one signature on the root of that tree is sufficient to sign the entire thing. In that way it is similar to the original proposal in this issue. The proposal uses a binary Merkel tree, where is this issue suggests a per folder Merkel trie. I understand that a trie only make sense for certain kinds of repository data. Are there significant advantages to a binary system over a trie system? Similarly, I had not heard of sparse Merkel tree before. Can you speak to their benefits in this context?

tarcieri commented 1 year ago

Sparse Merkle trees work well for hashmap-like key/value data, and many are designed to scale well with constant updates.

Here's a Rust implementation of one:

https://github.com/diem/diem/blob/main/storage/jellyfish-merkle/src/lib.rs

This paper describes a similar idea:

https://eprint.iacr.org/2018/955.pdf

mnm678 commented 1 year ago

That is an extremely interesting read. Thank you! If I read it correctly it has the contents of the files (and the files metadata) put in a Merkel tree, and then one signature on the root of that tree is sufficient to sign the entire thing. In that way it is similar to the original proposal in this issue. The proposal uses a binary Merkel tree, where is this issue suggests a per folder Merkel trie. I understand that a trie only make sense for certain kinds of repository data. Are there significant advantages to a binary system over a trie system? Similarly, I had not heard of sparse Merkel tree before. Can you speak to their benefits in this context?

The tree shape should only affect performance, and not the security properties, so splitting the tree per folder would work as long as all data is included.

As @tarcieri mentioned, the sparse Merkle tree has has performance benefits for frequent updates as less of the nodes have to change when a leaf is updated.

Eh2406 commented 1 year ago

I found sometime today to reread the TUF specification. I've been using some terminology wrong, I don't think it's technically the size of snapshot.json but the size of targets.json that is a scaling problem. The specification says

Each key of the TARGETS object is a TARGETPATH.

Which suggests that if TUF were applied directly, targets.json would list every file available from the registry each with a at least one hash a length and "custom" data. The custom data could be what's currently stored in the index. Crates.io serves 101k package versions, which would be a rather large targets.json file. But I must be missing something, because PyPi is at 428k. So I looked at PEP 458 which claims to have solved the scalability problem using Succinct hashed bin delegations (TAP 15). But when reading the specification path_hash_prefixes has to do with delegations. My understanding is that delegations describe which roles are allowed to sign the final artifacts. I don't see how path_hash_prefixes gets you out of targets being O(number of package versions). What am I misunderstanding about the specification?

tarcieri commented 1 year ago

@Eh2406 delegated targets are each stored in their own .json files which can be independently updated (and I believe also support nested delegations which would allow for a delegation hierarchy?)

Though their main use case is to delegate authority via different signing keys, I think it might also be possible to use them with the same signing keys to break up targets.json into many small files?

mnm678 commented 1 year ago

@Eh2406 TUF does the final artifact signing through targets roles. So targets.json only lists the files that this particular role is signing, and delegations to other roles. Hashed bin delegations can help with scalability when one role has a lot of files that it is signing.

trishankatdatadog commented 1 year ago

None of these are possible with the current design of the sparse http-based registries.

Thanks for opening the discussion! To start with, could someone please point me to documentation on these sparse HTTP-based registries, so that at least I can help better think about the problem, and advise on how to use TUF here?

ehuss commented 1 year ago

The documentation is in a to-be-landed PR here: https://github.com/rust-lang/cargo/pull/11224/files#diff-5e9d17cb5bab9eebb8cb0dc5e276eea6f62ffbce72a0e763a28cf102493e2104

kornelski commented 1 year ago

A couple of counter points:

  1. the git version of the index exists. Users who want to obtain the complete atomic snapshot of the index can still do so via the git protocol. Cargo will support git forever. I think crates.io also plans to keep the git version indefinitely.

  2. For the case of publishing and correctly resolving a set of dependent crates, a globally-consistent snapshot is not necessary. Correct dependency resolution only requires preserving a relative ordering between dependencies. This is a much smaller problem, with solutions suggested in the sparse registry RFC (in short, stale files can be detected and cachebusted).

  3. Even when using git registries, Cargo doesn't always end up with a "consistent" set of dependencies based on the same git index commit. When user adds new dependencies or increases their versions, and there's an existing Cargo.lock, Cargo only adds/updates a minimal set of dependencies, keeping other crates locked to a previous version when possible. This makes the set of dependencies a mix of resolutions done at different times, from different git commits of the index. An updated Cargo.lock may be different than what you'd get if you deleted Cargo.lock and resolved it again from scratch. So in Cargo a global consistency never existed, and that's a feature, not a bug.

kornelski commented 1 year ago

Because git had an easy way to represent a snapshot, signing of the commit has was an easy way to prove authenticity of all the data in the commit. However, I don't think the logic necessarily goes the other way. To prove authenticity of all the data, you don't have to have a single snapshot.

For example, every individual file could be signed separately. This is still a guarantee that the file came from crates.io at some point, so it still prevents the CDN or some other network MITM from tampering with the data and publishing something that has never been published.

Files can come with a signed version or signed last-modified timestamp, so rollback attacks (using terminology from TUF) on them can still be prevented.

A global snapshot has an advantage that a freeze attack requires the attacker to freeze everything, rather than individual files, so such attack is easier to spot. However, minimum version requirements in Cargo.toml still limit how much an attacker can withhold. If you bump a vulnerable dependency requirement to a non-vulnerable version, that dependency has to exist and can't be withheld without breaking resolution. If this attack is important to protect against, the registry could let Cargo know about latest-known-to-exist versions of dependencies in the metadata, separately from minimum required versions, to require Cargo to refresh other files anyway. Also the proposed incremental changelog extension of the sparse registry protocol would also make a freeze attack harder to pull off.

Eh2406 commented 1 year ago

Your counterpoints are, of course, all correct. Dependency resolution, which is the primary job of the index, does not require any of the properties asked for in this issue. And all of the properties requested in this issue can be obtained by falling back on the git index.

Nonetheless, some of these properties are needed for other desired functionality in cargo and under circumstances that make "just pull the git index" too slow. Specifically suggesting alternative names when a package is not found.

Also the proposed incremental changelog extension of the sparse registry protocol would also make a freeze attack harder to pull off.

Yes, at some level this is an alternative API for a changelog. For performance any changelog system needs to have a "snapshot" mechanism for compressing the history into a description of the current state. At the extreme position, the registry could just publish snapshots and the changes can be reconstructed by taking diffs. Which is the proposal being suggested here. Thank you for reminding us to think about whether a more explicit method for tracking changes leads to a better API.

Eh2406 commented 1 year ago

I have collected some relevant data. The plane names file for the current crates.io index is 1.5MB, but only 0.5MB when gziped. We thought this would compress well and it did. Adding a Sha256 hash as hex balloons the file size to 8.1MB, and 4.6MB when gziped. As expected, it did not compress particularly well. I have not yet looked at binary formats, let me know on zulipchat if there's something you want me to try. Also for reference I looked at how many prefixes there are ab/cd there are about 27K. The original proposal here has a little more than one file per prefix. So downloading the entire tree will involve a lot of requests.

This also got me thinking about one of the primary use cases "suggesting alternative names when a package is not found". If the user typed in the package abcdefghij, we can look to the file ab/cd/abcdefghij to see if there is a package that normalizes to the same name. If we use a Merkel tree based on folders, we can look at the node ab/bc to get all the names that share that prefix. However if we want to search for packages that are within a small Levenstein distance of abcdefghij, we end up having to download almost all the files. If small >= 4, then I think it literally is all files. It feels like 27K is too many requests to be making while a user waits for a response on the command line.

Eh2406 commented 1 year ago

I was recommended to try https://en.wikipedia.org/wiki/Incremental_encoding on the plain names file. With a naive implementation (which is only valuable for size comparisons) 0.65MB and 0.35MB after gziped. That is a nontrivial size savings. Of course, defining our own pre-compression step will require careful documentation and testing.

kornelski commented 1 year ago

I believe that trying to correct typos on the client side would amplify dangers of typosquatting. The index lacks information about popularity/reputation/relevance, so it can't pick a better crate from among candidates, it can only pick based on names alone, without knowing which ones are spam, and use of mere edit distance opens up a risk of picking easy-to-register close names (e.g. instead of correcting git2-sys to libgit2-sys, it could pick git1-sys).

Because typo handling is not a high-traffic service critical for Cargo's uptime, and crates.io has all of the necessary information available server-side to make good and safe suggestions, I think this particular case should be handled with a server-side API instead. cargo search already set a precedent for using a server-side API instead of scanning the index.

Eh2406 commented 2 months ago

This came up again in discussions today. I tracked down the code I used for trying Incremental encoding and ran it again to collect new numbers. After posting I had hardened the implementation so the data successfully and always roundtrips as tested by proptest. Compressed were looking at 1.2 MB which compresses to 658 KB. This imagines that were storing a u16 version number and not a hash. Here is the code.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=b01b9475066689ea95ed92e245670b03

kornelski commented 2 months ago

Have use-cases for this emerged?

The sparse protocol has been the default for a while now. I haven't seen people asking for consistent snapshots (or maybe the git copy is sufficient?)

I've implemented a private registry and a few registry-adjacent tools, and haven't needed the cross-crate consistency myself. For normal use, cargo publish waiting for the cache to purge helps a lot. My own registry implementation is based on a key-value store, so it doesn't even have a notion of a global version.

BTW, the git index has broken crates depending on crates that don't exist any more, so even use of the git snapshot requires dealing with "dangling pointers" of crates.

Eh2406 commented 2 months ago

No new use cases have emerged. As you and I have both pointed out to many people, many times, dependency resolution does not need this issue to be solved. For now users who need consistent views or an exact list of changes we are recommending that they use the git protocol.

Where this conversation came up again, is in discussions of un-trusted mirrors. Specifically signings schemes that prevent a mirror/MITM from serving an old version of an index file. In the TUF terminology this is called a rollback attack. The signature on each index file could be short-lived to prevent this attack, but then crates.io will need to continuously be refreshing the signatures for all index files. Which ends up being expensive when the registry gets big. A solution to this issue would allow most files to have a long-lived signature and only the "consistent snapshot" file needing a frequently refreshed signature. (Probably stored in a separate "timestamp" file for bandwidth reasons, but that's an implementation detail.)

The other major use case for this are tools that want to process ALL changes to the index. For example mirrors that want to proactively ingest all new publishes. Or security scanners. These tools can currently reliably run off the git index, which can provide deltas between "the current state" and "the state when the tool was last run", as long as the tool has access to a file system to do the git operations on. Or they can be built on top of the just released RSS support in crates.io, as long as they can guarantee that they always run more frequently than the RSS expiration time.

tarcieri commented 2 months ago

Since it hasn't been mentioned yet: TAP16: snapshot Merkle trees would probably be the best mechanism on the TUF side to encode this, if that's desired (though I understand there's also work on using RSA accumulators which could potentially even be more succinct)

trishankatdatadog commented 2 months ago

Since it hasn't been mentioned yet: TAP16: snapshot Merkle trees would probably be the best mechanism on the TUF side to encode this, if that's desired (though I understand there's also work on using RSA accumulators which could potentially even be more succinct)

BTW, you can find @mnm678's investigation of Merkle trees vs RSA accumulators in Chapter 3 of her PhD thesis. If I'm not mistaken, I think Zachary Newman and herself found that Merkle trees generally work better.

trishankatdatadog commented 2 months ago

BTW I'll also be working on productionizing RSTUF for PEP 458 in Q4, so let me know if you have any Qs or would like to collaborate!

Eh2406 commented 3 weeks ago

For the past 2 weeks I've been studying an intriguing but overly complicated idea. I'm going to write down what I figured out in the hope that I can stop thinking about it and move on.

Some review: In order to have a snapshot of the index one needs a list of every file and for each file a hash or version number. A simple CSV of name,hash, ends up being quite large and does not compress well. Mostly because the hash has very high entropy. If we replace the hash with the modification number, then we need another layer of protection to make sure a MITM is handing us the correct data for that modification number. So let's make every index file come with a cryptographic signature from the appropriate TUF role validating the contents and the version number. Now our name,modification is ~758 KiB compressed. If we apply incremental encoding on the names we get ~600KiB compressed. (Numbers are larger than previous updates as the index keeps growing.)

New ideas: It's kind of a shame to tie the large "list of names" with the "modification numbers". A new name getting added to the index happens in order of magnitude less frequently than a new version being published and the names are ~75% of the size of the file. Most downloads of the new snapshot will end up downloading a large amount of name data that has not changed. So let's put them into two files:

  1. The list of names incrementally encoded. This file ends up being ~400 KiB compressed.
  2. The list of modification numbers. This file ends up being ~349 KiB uncompressed and ~99.3 KiB compressed.

To check the most current modification number four a crate you find the position of the crate name in the first file and look up the value of the corresponding spot in the second file. As @kornelski points out, most users do not need to know about a new name; they only want to check if any of the crates they used for resolution have been held back by a MITM. For this use case they do not need to list of names they only need the modification numbers. But the entire file size savings was by not storing the names in the second file. What to do...

Probabilistic data structures: There are data structures that act like hash maps but do not store the keys. The state-of-the-art algorithms can guarantee that get(key) -> Option<value> on the data structures get the correct answer for all keys that were used during construction. They may spuriously return Some(made-up value) for keys that were not used during the construction. Very quickly they can construct this in O(count of (keys)) space. If we used one of these data structures most users could look up what they care about from file 2 without needing to look at file 1! (The exception being people who are doing a full mirror or scan of the index, but they can probably afford to download 0.5 MB every few minutes.) What about the spurious return values? If a user is looking up a crate that's not in the index, and file 2 incorrectly returns version number 42, then the client will ask for ab/cd/abcd-fobar/v/42 from there mirror/MITM. A nonmalicious mirror will return 404, correctly telling the client file doesn't exist. A malicious mirror can at best return a file whose signature does not match the requested path. So a user can be guaranteed that if they receive a index file it is authentic and up-to-date. If they receive a 404, they can double check with file 1 if they want to be extra careful.

The complexity of basic operations:

The XOR provide one family of state-of-the-art algorithms in the space. Big thanks to @ayazhafiz xorf library which allowed me to experiment with these algorithms in Rust. The very nice thing about this family of algorithms is that all of the complexity is in constructing the data structure. The data structure itself can be serialized approximately as (seeds :[u16; 3], data: Vec<u16>), and can be queried with the algorithm data[hash(key, seed[0], vec.len())] ^ data[hash(key, seed[1], vec.len())] ^ data[hash(key, seed[2], vec.len())]. Using BinaryFuse (modified from the library) it can be stored in ~368 KiB uncompressed and ~270 KiB compressed. It turns out that there are surprisingly few distinct values in the list of modification numbers, if we instead store (modifications: Vec<u16>, seeds :[u16; 3], data: Vec<u8>). The file size shrinks to ~196 Kib uncompressed ~164 Kib compressed.

These file sizes seem big, because they are approaching optimal for O(count of (keys)) which is much larger than O(entropy of values). The algorithms used for constructing the XOR family involve creating a mapping so that each key has a slot it is allowed to modify and each slot is only modified by one key. So they were not amenable to additional compression related to the distribution of values. The "ribbon filter" family of algorithms construct the problem as a matrix decomposition, and might be more flexible. After some digging I found @bitwiseshiftleft in the compressed_map crate. While I did not succeed at understanding or modifying the underlying algorithms, the CompressedMap from the crate does exactly what we want. It ends up being ~80 KiB uncompressed ~81 KiB compressed.

Whether any of this is useful or not I'm not sure. But hopefully I can be done thinking about it.

Edit: I was really hoping that by writing it up I could be done thinking about it. But it didn't work... I think the algorithm is using "arithmetic coding" (or some homegrown alternative). But I can explain it better using Huffman encoding.

Take a step back. The size of a probabilistic map is (some constant C) * (the number of keys we want to guarantee we get the correct answer for) * (the size of each value) (if we ignore taking advantage of "accidentally getting the correct answer"). At first blush it seems like we're stuck. The size of the size of each value is basically fixed, as are the number of keys we care about getting the correct answer for. We could deal with different bits of the value in different structures, but our formula is linear so that doesn't get us anything. It takes the same amount of space to say we have N structures that each encode 1 bit as it does if we say we have 1 structure that encodes N bits.

That's where Huffman encoding comes in. We encode the values according to the Huffman algorithm. Then we process things one bit at a time. The first data structure encodes the first bit and needs to be correct for every key. So it takes quite a bit of space, it avoids being a disastrous amount because it's only one bit wide. We can do this again for the next bit but this time we can skip any keys whose Huffman encoding has already reached a leaf node. We don't care what value is returned by the structure for that bit of that key, because a reader processing that key will have found a value. With every additional bit we have fewer keys to process and therefore a smaller data structure.

tarcieri commented 3 weeks ago

These file sizes seem big, because they are approaching optimal for O(count of (keys)) which is much larger than O(entropy of values). The algorithms used for constructing the XOR family involve creating a mapping so that each key has a slot it is allowed to modify and each slot is only modified by one key. So they were not amenable to additional compression related to the distribution of values. The "ribbon filter" family of algorithms construct the problem as a matrix decomposition, and might be more flexible. After some digging I found @bitwiseshiftleft in the compressed_map crate. While I did not succeed at understanding or modifying the underlying algorithms, the CompressedMap from the crate does exactly what we want. It ends up being ~80 KiB uncompressed ~81 KiB compressed.

Interesting! Nice to see @bitwiseshiftleft is writing Rust, too!

@Eh2406 If you're generally interested in efficiently updatable verifiable map-like data structures, it would also be worth taking a look at Jellyfish Merkle Trees. The Rust implementation was originally developed by Libra, but there are several forks, for example: https://docs.rs/jmt/

bitwiseshiftleft commented 3 weeks ago

Oh hello @tarcieri, nice to see you too.

So as not an expert in the Rust registry system I don't really know whether compressed_map or similar is suitable here, or if you'd rather just use Merkle trees. Merkle usually works well if you want short proofs and efficient updates, which seems appropriate here, whereas compressed_map is good for downloading some info on the state of the whole system if you don't need to know what keys are in the map. It's also only research-grade, and not updatable (none of these matrix things are) except by using it to compress deltas.

My compressed_map crate can be thought of as more or less the same idea as a Huffman code plus a BinaryFuse-like matrix map, much as @Eh2406 suggests. It's not quite a Huffman code because the possibility of getting the correct answer "by accident" changes the cost model, but it's the same idea. The matrix map is much more complicated to solve than BinaryFuse (which came out later), and also slower, but it gets a few % better compression.

This was my first (and so far only) significant Rust project so that is part of why the code is difficult to read. Also I never really finished the paper on it...

trishankatdatadog commented 2 weeks ago

In order to have a snapshot of the index one needs a list of every file and for each file a hash or version number.

Sorry for the late comment, but you don't need to list hashes unless you are worried about malicious mirrors (see Section 5.6 in this paper). But if there are no mirrors and only ever one canonical package index, then you can safely have savings by omitting hashes.

Edit: It is possible I misunderstood what you meant "snapshot." If you mean the TUF snapshot metadata, then my comment above holds true. If not, and you mean what is called the TUF targets metadata, then please ignore πŸ™‚

Eh2406 commented 1 week ago

I am uncertain whether I got the TUF terminology correct. In either case malicious mirrors are exactly why this issue has come up again. Currently it requires a large amount of trust and coordination for a new official mirror to be set up. we only go through this effort for the official CDN's. If we had TUF like protections against malicious mirrors, then the project would have a much easier time endorsing whatever other mirrors people find useful. (We are particularly excited about same building mirrors for common CI providers.)