rust-secure-code / cargo-supply-chain

Gather author, contributor and publisher data on crates in your dependency graph.
Apache License 2.0
313 stars 18 forks source link

Cache "live" results from crates.io #48

Open Shnatsel opened 3 years ago

Shnatsel commented 3 years ago

When downloading data via the crates.io API, we could cache it for later reuse. This would help if the user wants to view both crates and publishers commands for their crate or adjust the cargo-metadata parameters (e.g. target platform).

The timestamp of when the data was downloaded should be preserved; the cached data should be used only if the --cache-max-age configuration allows it.

If there are any cache entries with a timestamp from the future, they should be discarded.

tarcieri commented 3 years ago

Is the data you need cached locally somewhere already, e.g. either in the crates.io index itself (which can be consumed using the crates_index crate), or via the crate file cache located in ~/.cargo/registry/cache?

Shnatsel commented 3 years ago

Good question! Unfortunately it is not. We need the data about the crates.io publishers, which is present in neither of those places.

tarcieri commented 3 years ago

Aah, unfortunate. Perhaps it'd be worth opening an upstream issue to include that information in the index?

Shnatsel commented 3 years ago

I don't think it's a good idea to include it in the index, actually. This info is not needed for most uses - that's why it's not in the index! It is included in the daily database dumps, but they are currently served as a monolithic ~250Mb .tar.gz archive even though we need only 10Mb (uncompressed) from it. Splitting that into a separate file would achieve 100x reduction in traffic for the update subcommand; this is discussed in more detail in #45.

Shnatsel commented 3 years ago

If we choose to use a granular cache, it makes sense to store it on-disk in JSON since it's basically a map and we already have a dependency on serde-json due to the requirement of parsing JSON from crates.io API.

And we already have the cache directory created for storing the crates.io dump.

tarcieri commented 3 years ago

I'm not sure they ever made a conscious decision whether or not to include it in the index. It's a feature that was added to crates.io quite awhile after the index was created. It's also (somewhat) low-cardinality data that would compress well.

I think the nice part about having it in the index is the index provides a timestamped/append-only(-ish) cryptographic(-ish, with the unfortunate problem of SHA-1 collisions) log, so including audit info would commit to that, as opposed to it potentially being retroactively modified by an attacker in the event of a crates.io compromise.

Shnatsel commented 2 years ago

https://crates.io/crates/structsy sounds like a better way to store data on disk than JSON files.