rust-secure-code / cargo-supply-chain

Gather author, contributor and publisher data on crates in your dependency graph.
Apache License 2.0
315 stars 18 forks source link

Faster crates.io fetching? #8

Closed repi closed 3 years ago

repi commented 3 years ago

Awesome project, been wanting a tool with this type of functionality for a while and really glad ran into it!

We have a fairly large project with, ahem, 548 crate dependencies, so the 2 second delay between fetching data on each crates.io well does add up quite a bit!

Fetching publisher info from crates.io
This will take roughly 2 seconds per crate due to API rate limits
Fetching data for "addr2line" (0/548)
Fetching data for "adler" (1/548)

Are there any paths to speeding this up?

Shnatsel commented 3 years ago

According to https://crates.io/data-access and https://crates.io/policies#crawlers, crates.io requires crawlers to request no more than 1 page per second. There is no API access policy for non-crawler usage, so I prefer to err on the side of caution.

crates.io API is undocumented, so there might be batched queries I'm not aware of. So far I've reverse-engineered the web UI to write this tool.

It is possible to load the data from the crates.io database dump. The relevant files are only 3.5Mb total when uncompressed. However, they are inside a 250Mb archive, which complicates access. And they are only updated once a day.

I'm going to talk to crates.io team about this. Is operating on data that's updated daily good enough for your use cases, or do you need live data?

repi commented 3 years ago

Thanks! Think daily data would work just fine for our usage of using the CLI manually for reviewing and looking over dependencies.

If one would later automate this in CI use it as a security review thing, potentially through our cargo-deny and have specific allow/deny list of individual and group publishers, then having the correct data from the time it is run could be more important. But then one could also just run in the current rate limited mode also.

Shnatsel commented 3 years ago

I've raised the topic in the crates.io team discord, you're welcome to join the conversation. See https://www.rust-lang.org/governance/teams/crates-io

Automation would likely require structured output as well. The tool is only a week old so we didn't get there yet :smile:

repi commented 3 years ago

Hah that is perfectly fine, just glad you built the tool and I found it :)

Shnatsel commented 3 years ago

12 might be of use - it enables the use of crates.io database dumps.

Shnatsel commented 3 years ago

I never heard back from crates.io team about whether the scraping limits apply to cargo-supply-chain or not, but we have a fairly mature infrastructure for using database dumps now, so I'm going to go ahead and close this.

Shnatsel commented 3 years ago

Also, I would be interested to hear what use cases Embark has for the tool, to understand what kind of facilities would be interesting to users.

There's a gazillion things we could do from structured output to cargo-deny/cargo-crev whitelist/blacklist model to notifications about changes and numerous other features. But I don't want to sink effort into any of those until there is a clear use case.

repi commented 3 years ago

Nice, I'll test the current database dump support.

Not fully sure yet what our use cases will/can be, but now when it is faster with the dumps we should be able to experiment more. So makes perfect sense to not go to deep with any other specific implementation or optimization until one has some more clarity around this. Thanks!

Shnatsel commented 3 years ago

@repi the latest cargo supply-chain supports JSON output, so you can implement custom logic on top of it and/or integrate with cargo deny. No crate yet - just the CLI, but that could possibly be changed.

I'm still interested in hearing about the use cases you may have for the tool - we might want to support some of them in cargo supply-chain itself.