Open Shnatsel opened 3 years ago
Is the data you need cached locally somewhere already, e.g. either in the crates.io index itself (which can be consumed using the crates_index
crate), or via the crate file cache located in ~/.cargo/registry/cache
?
Good question! Unfortunately it is not. We need the data about the crates.io publishers, which is present in neither of those places.
Aah, unfortunate. Perhaps it'd be worth opening an upstream issue to include that information in the index?
I don't think it's a good idea to include it in the index, actually. This info is not needed for most uses - that's why it's not in the index!
It is included in the daily database dumps, but they are currently served as a monolithic ~250Mb .tar.gz archive even though we need only 10Mb (uncompressed) from it. Splitting that into a separate file would achieve 100x reduction in traffic for the update
subcommand; this is discussed in more detail in #45.
If we choose to use a granular cache, it makes sense to store it on-disk in JSON since it's basically a map and we already have a dependency on serde-json
due to the requirement of parsing JSON from crates.io API.
And we already have the cache directory created for storing the crates.io dump.
I'm not sure they ever made a conscious decision whether or not to include it in the index. It's a feature that was added to crates.io quite awhile after the index was created. It's also (somewhat) low-cardinality data that would compress well.
I think the nice part about having it in the index is the index provides a timestamped/append-only(-ish) cryptographic(-ish, with the unfortunate problem of SHA-1 collisions) log, so including audit info would commit to that, as opposed to it potentially being retroactively modified by an attacker in the event of a crates.io compromise.
https://crates.io/crates/structsy sounds like a better way to store data on disk than JSON files.
When downloading data via the crates.io API, we could cache it for later reuse. This would help if the user wants to view both
crates
andpublishers
commands for their crate or adjust the cargo-metadata parameters (e.g. target platform).The timestamp of when the data was downloaded should be preserved; the cached data should be used only if the
--cache-max-age
configuration allows it.If there are any cache entries with a timestamp from the future, they should be discarded.