Closed repi closed 3 years ago
According to https://crates.io/data-access and https://crates.io/policies#crawlers, crates.io requires crawlers to request no more than 1 page per second. There is no API access policy for non-crawler usage, so I prefer to err on the side of caution.
crates.io API is undocumented, so there might be batched queries I'm not aware of. So far I've reverse-engineered the web UI to write this tool.
It is possible to load the data from the crates.io database dump. The relevant files are only 3.5Mb total when uncompressed. However, they are inside a 250Mb archive, which complicates access. And they are only updated once a day.
I'm going to talk to crates.io team about this. Is operating on data that's updated daily good enough for your use cases, or do you need live data?
Thanks! Think daily data would work just fine for our usage of using the CLI manually for reviewing and looking over dependencies.
If one would later automate this in CI use it as a security review thing, potentially through our cargo-deny and have specific allow/deny list of individual and group publishers, then having the correct data from the time it is run could be more important. But then one could also just run in the current rate limited mode also.
I've raised the topic in the crates.io team discord, you're welcome to join the conversation. See https://www.rust-lang.org/governance/teams/crates-io
Automation would likely require structured output as well. The tool is only a week old so we didn't get there yet :smile:
Hah that is perfectly fine, just glad you built the tool and I found it :)
I never heard back from crates.io team about whether the scraping limits apply to cargo-supply-chain
or not, but we have a fairly mature infrastructure for using database dumps now, so I'm going to go ahead and close this.
Also, I would be interested to hear what use cases Embark has for the tool, to understand what kind of facilities would be interesting to users.
There's a gazillion things we could do from structured output to cargo-deny/cargo-crev whitelist/blacklist model to notifications about changes and numerous other features. But I don't want to sink effort into any of those until there is a clear use case.
Nice, I'll test the current database dump support.
Not fully sure yet what our use cases will/can be, but now when it is faster with the dumps we should be able to experiment more. So makes perfect sense to not go to deep with any other specific implementation or optimization until one has some more clarity around this. Thanks!
@repi the latest cargo supply-chain
supports JSON output, so you can implement custom logic on top of it and/or integrate with cargo deny
. No crate yet - just the CLI, but that could possibly be changed.
I'm still interested in hearing about the use cases you may have for the tool - we might want to support some of them in cargo supply-chain
itself.
Awesome project, been wanting a tool with this type of functionality for a while and really glad ran into it!
We have a fairly large project with, ahem, 548 crate dependencies, so the 2 second delay between fetching data on each crates.io well does add up quite a bit!
Are there any paths to speeding this up?