Open AMDmi3 opened 4 years ago
Thanks for the feature request. I think one thing missing from this discussion is how up-to-date this dump would need to be to satisfy your use case. Would it need to be updated every month? week? day? hour? less?
The more up to date the better, obviously, so I'd prefer hourly dump in order to notify users of new releases as soon as possible, while Repology itself has a lag of around an hour. But if that's not possible for some reason, daily, or something in between would be better than nothing.
There is now a https://pypicache.repology.org/
How to integrate that into warehouse API? Uploading daily and hourly historical archives of metadata in https://jsonlines.org/ format for packages uploaded during that period doesn't sound too complicated. And a separate process could provide real-time streaming and sync for hourly updates (akin to blockchain).
Where is the place where warehouse
maintains it "cron jobs"?
Yes, I am a researcher in the field of software engineering. Such a data dump file is very useful for us. Even if it is updated annually, I very much hope that such data can be provided.
What's the problem this feature will solve? As discussed before in #347, #1478 and #7403, there's a need for an all packages metadata dump as a single file. The only way to obtain this data is currently using google BigQuery datasets, which is not an option for anyone without google account, not wanting to disclose personal data to google to create one, not able to do so because of verification problems, and/or not wanting to impose the same limitations on their users (e.g. if someone wants to distribute an application which works with PyPI metadata which works out of box, without the need to specify user's google account credentials).
Describe the solution you'd like A single file (probably a compressed JSON) dump of metadata for all PyPI packages. Probably in the same format as JSON API returns, just an aggregation of data for all packages into a single array.
Additional context For instance, I need such a dump for repology. After the PyPI simple index was removed in ~2017, I have no way to get latest versions of all modules from PyPI at once. Using BigQuery is not possible for reasons described above, and iterating all projects via API is too slow, inconsistent and unreliable for the purpose.