rust-secure-code / cargo-supply-chain

Gather author, contributor and publisher data on crates in your dependency graph.
Apache License 2.0
311 stars 18 forks source link

Crate fetching is slow #45

Open Owez opened 3 years ago

Owez commented 3 years ago

I'm running a test run using the crates command and as I'm writing it's fetching crates. It seems to take quite a long time, even with a relatively low dependency crate such as cargo-supply-chain (only 79 deps listed), as seen here:

image

It would be useful and more functional if it fetched these in parallel, especially due to the high latency with the crates.io api.

HeroicKatora commented 3 years ago

This is by design. There is no high latency, it's just being cordial to crates.io as they suggest a 2 second delay between individual requests and there is no endpoint for performing multiple queries in the same request. I suppose we could clarify this in the output some more but 'This will take roughly 2 seconds per crate due to API rate limits' seemed clear enough.

@Shnatsel Would you vote against a flag that would allow overriding this limit? I'm cautious and probably wouldn't add it but it might be 'fine' if we also enforced that one sets a custom user agent in these situations. (Similar to how nmap allows overriding the default rate to ridiculously high rates if you actually want to barrage your local network but you need two flags and know what you're doing). And we should ask the crates team if doing it once for testing would be of concern to them, or if they could offer a bulk endpoint.

Shnatsel commented 3 years ago

We would need to clear higher request rate with the crates.io team. Right now the documentation explicitly mentions one request per second for crawlers but says nothing about other uses. I believe docs.rs makes much higher request rates. I have asked in the crates.io Discord once, but received no reply.

Right now you can speed up this process by running cargo supply-chain update. After that all operations will be performed immediately, but may be out of date by up to 48 hours.

Shnatsel commented 3 years ago

@Owez could you post the output of time cargo supply-chain crates so we could see if it's abnormally high? 80 * 2 = 160 seconds is the expected time to fetch the live data.

Owez commented 3 years ago
0.22s user 0.04s system 0% cpu 2:37.31 total

My time seems normal and does feel ~2 secs delay. Without a bulk api the delay seems unfortunately borderline unusable, especially on a crate with a larger dependency count. Could automatic non-update caching be used and deleted if >24hr for now to have a balanced workaround then using the whole crates.io index?

Shnatsel commented 3 years ago

Such granular caching would add complexity, and I am not eager to do so. Although I fear we will have to go down that road sooner or later, at least because people will likely want to view both crates and publishers for the same crate and it would make no sense to re-download the data.

I see two ways we could solve this without going for a crate-granular caching system:

  1. We could try to contact the crates.io team again and ask what the rate limits for this tool should be, since they only have rate limits specified for crawlers (by which we abide).
  2. The file we need is only ~10Mb, but it is stuffed into a ~250Mb monolithic archive, and we have to download the entire 250Mb. Splitting up the archive would reduce the required amount of traffic for the update command 25x-100x.
HeroicKatora commented 3 years ago

We could estimate the time for fetching them one-by-one and the time for downloading the complete dump, and adaptively choose to refresh the cache. I'm slightly concerned how this would affect script behavior that wants to rely on either of the two behaviors.

tarcieri commented 3 years ago

The file we need is only ~10Mb, but it is stuffed into a ~250Mb monolithic archive, and we have to download the entire 250Mb. Splitting up the archive would reduce the required amount of traffic for the update command 25x-100x.

Could an ETL job work here? If so, you can probably use GitHub Actions for this in a completely automated manner by periodically running a job that commits to a git repo like this:

https://github.com/RustSec/advisory-db/blob/master/.github/workflows/publish-web.yml

That could either be fetched via git (where the periodically running job commits deltas from the previous state), or served over HTTP via GitHub Pages

HeroicKatora commented 3 years ago

I don't think we want to keep old files around, as that is just duplicate content. However, it does becomes more attractive after thinking about it for a while. In particular, we could use a Github repo or just a git branch that has its version history regularly squashed. This would mitigate the growth and give Github a chance to remove old files (otherwise we'd already be at a few gigabytes after a year). This would also allow us to use freely available CDNs for Github files.

tarcieri commented 3 years ago

GitHub Pages will give you a CDN for free.

I was suggesting that if you regenerated a text file (or multiple text files, ideally) you wouldn't need to leave "old files" around as the repo would be updated with the parts that changed.

If you don't want/care about that, you could just force push every time, with GitHub Pages (or even just https://raw.githubusercontent.com/) serving the content over HTTP(s).