Open chadwhitacre opened 7 years ago
https://packaging.python.org/mirrors/ https://pypi.org/ https://pypi.python.org/mirrors https://www.pypi-mirrors.org/ https://pypi.python.org/pypi/bandersnatch https://bitbucket.org/pypa/bandersnatch/ https://github.com/openstack-infra/pypi-mirror https://github.com/pypa/warehouse/ http://doc.devpi.net/latest/
Ultimately what we want is to pass in a requirements.txt
and/or a setup.py
and get back a structure representing the dependencies.
Hmm ... https://graphcommons.com/
Jackpot!
Well, jackpot: https://github.com/anvaka/pm#individual-visualizations. Indexers for 13 ecosystems!
As of https://mail.python.org/pipermail/distutils-sig/2015-January/025683.html we don't have to worry about mutability in PyPI. That means we never need to update info once we have it. It's possible to delete packages but I think we don't want to do that. We want to keep old info around.
And we only need one-week granularity. If we update once a day we'll be well inside our loop.
Where the mutability comes in is that dependencies are subject to a range. If I depend on Foo >= 1.0
, then when Foo 1.1 comes out my dependency chain will need to be updated.
What's the data structure we want?
We want to support taking in a list of files of type text/x-python
(for setup.py
) and/or text/plain
(for requirements.txt
), and returning a single flattened list of dependencies with this info:
text/plain
—filename or URL; and line numbertext/x-python
—filename, URL, or package; parameter (*_require{s}
); and index in the argumentTake care to handle different files with the same name in the upload.
Two more hops:
Briefly spoke on the phone with @whit537 and summarized the basic guidance for PyPI data analytics that require traversing dependency trees.
Note that right now metadata available via PyPI is limited and may be for some time as PEP 426 is indefinitely deferred. Information is available in the deferment section of the PEP describing other PEPs addressing these topics.
In order to crawl PyPI for dependency links, you'll need general metadata for "indexing" as well as the package files themselves to obtain dependency information via setuptools
/distutils
.
Recommended approach:
list_packages()
call to retrieve a list of packages registered with PyPI.https://pypi.org/pypi/<package_name>/json
to retrieve metadata and releases for individual packages (https://pypi.org/pypi/requests/json
for example)https://pypi.org/pypi/<package_name>/<release_identifier>/json
(https://pypi.org/pypi/requests/2.11.1/json
for example)All of the above endpoints and tools with the exception of the XMLRPC are designed to minimize impact on the PyPI backend infrastructure, as they are easily cached in our CDN.
A table for releases
, unique on (package_manager_id
, package_id
, version
), also has path
, license
, and osi_license
columns.
A releases.deps
column stores upstream dependencies, which are self-references to specific releases
, along with version_range
and required_by
for each. Resolving a set of dependencies is then a matter of merging the deps
chains for the input set. The process used to precompute deps
should be usable on-the-fly for queries.
PyPI gives us a changelog that includes new release
events. With that, we should be able to recompute deps
for the subset of affected packages; we'll need a reverse mapping (release_id
, required_by
) for that. Basically, when a new release comes out we want to:
deps
for that release, anddeps
.We'll need to keep a table of packages
, and do the reverse mapping based on that. Something like:
for each package in might_depend_on(new_release):
for release in package.releases:
release.update_deps(new_release)
[edit—what he said :]
So for ETL we can look at bandersnatch ... before the call I had been thinking we'd roll our own (I'd already started based on https://github.com/gratipay/gdr.rocks/issues/2#issuecomment-254511187) that would look like this:
new release
in the changelog XML-RPCAnother point @ewdurbin made on the phone is that some projects vendor in their dependencies (e.g., that's how Requests uses urllib3), and an approach that looks only at setup.py
and requirements.txt
won't pick that up.
I've downloaded and run a bit of bandersnatch. I am finding tarballs. I think we should be able to get what we need from that, without having to resort to the JSON API (bandersnatch does use fetch JSON under the hood, but afaict it throws it away) . The name, version, and license are in the PKG-INFO
(is PKG-INFO
guaranteed to exist and have those keys?). With the name we can compute the url. osi_license will be something we compute based on license. Depenedency info we've already said we need to extract from the tarballs.
One issue with bandersnatch is that it doesn't download tarballs that aren't on PyPI.
Another is that we don't actually need to keep the tarballs around after we process them. Doing so would cost about $50/mo at Digital Ocean. Will we be able to easily convince bandersnatch not to redownload things we've already downloaded and then deleted?
D'oh! :-/
If we can delete old tarballs without tripping up bandersnatch, then we should be able to run a bandersnatch process, and a second process to consume tarballs: ETL them and then throw them away. This second process can run cronishly, offset from bandersnatch, and simply walk the tree looking for tarballs.
I've moved http://gdr.rocks/ over to NYC1 and am attaching a 500 GB volume.
Derp. Volume are only resizable up.
cd /mnt/pypi/
virtualenv .
bin/pip install bandersnatch
bin/bandersnatch -c conf mirror
vim conf
directory = /mnt/pypi
delete-packages = false
nohup bin/bandersnatch -c conf mirror &
grep 'Storing index page' nohup.out
indicates that it's processed about 1% of records so far.
That puts us at about eight hours to finish.
Okay! Let's do some local testing wrt snatching tarballs out from under bandersnatch. Also: ETL.
From reading through mirror.py
, it looks like we should be able to satisfy bandersnatch with a state
file that records the serial number we are satisfied that we're good through. What is a serial?
Here's what it looks like when I echo 2229089 > status
and rerun:
[gdr]$ bandersnatch -c conf mirror
2016-10-18 16:08:17,248 INFO: bandersnatch/1.11 (CPython 2.7.11-final0, Darwin 14.5.0 x86_64)
2016-10-18 16:08:17,248 INFO: Removing inconsistent todo list.
2016-10-18 16:08:17,249 INFO: Syncing with https://pypi.python.org.
2016-10-18 16:08:17,250 INFO: Current mirror serial: 2229089
2016-10-18 16:08:17,250 INFO: Syncing based on changelog.
The weird thing is that on the first run through, it processes packages in alphabetical order by name, not in numeric order by serial. It only writes status
after a successful sync. On subsequent runs, it uses the changelog RPC. But what is it doing with serial in that case?
It actually sorts alphabetically in either case.
Already in the Ms. Maybe I calculated wrong?
How does it differentiate new releases from old when syncing based on changelog?
How does it differentiate new releases from old when syncing based on changelog?
Hrm. As I read it, it doesn't. :-/
The set of packages to sync is put on the queue. Workers call package.sync
. It loads the JSON for the package as a whole, and then iterates over all releases. If a release file doesn't exist or doesn't pass a checksum, it redownloads it. Shucks! :-/
Gonna move on to extraction for now. Will have to come back to that later (probably after MVP).
I've downloaded 229 releases locally to play with.
Yeah, it's gonna be eight hours. ☺️
2016-10-18 21:09:48,606 INFO: Storing index page: Zwiki
2016-10-18 21:09:49,062 INFO: Storing index page: a
Been 1.5 so far, or 18.75%. Disk is 14% full, and logfile is on its way to 31 MB / 18.75% = 165 MB.
I'm not sure we care about requirements.txt
inside packages. If that's just there for the developers of the projects to use while developing, then end-users don't really have those as a direct dependency. Insofar as they are true requirements of the package, they'll be sourced into setup.py
. I think we just focus on setup.py
.
Hrm. We only need one tarball per release, but we want exactly one per release. PyPI/bandersnatch doesn't organize tarballs on the filesystem in a way that makes it easy to accomplish this.
I guess we can handle that if we ETL in a tight loop instead of three big loops. When we first crack a tarball we can see if the name and version are already in our database before processing it further.
Because bandersnatch supposedly guarantees that we do have all of the tarballs.
We'll have to unpack wheels as well as zips and tgzs. Anything else, I wonder? Eggs?
Half-way done! 39% full disk. 130 MB logfile.
Done! 359 GB, with a 255 MB logfile.
Eff. I re-ran bandersnatch -c conf mirror
and then Ctrl-C'd it, and now I think I may have lost a status
file or something that would prevent a full resync. :-/
Okay, seems like not quite that bad:
root@gdr:/mnt/pypi# mv nohup.out initial.log
root@gdr:/mnt/pypi# cp todo todo.bak
root@gdr:/mnt/pypi# head nohup.out
2016-10-19 11:07:41,396 INFO: bandersnatch/1.11 (CPython 2.7.12-final0, Linux 4.4.0-42-generic x86_64)
2016-10-19 11:07:41,397 INFO: Status file missing. Starting over.
2016-10-19 11:07:41,397 INFO: Syncing with https://pypi.python.org.
2016-10-19 11:07:41,397 INFO: Current mirror serial: 0
2016-10-19 11:07:41,397 INFO: Resuming interrupted sync from local todo list.
2016-10-19 11:07:41,400 INFO: Trying to reach serial: 2408615
2016-10-19 11:07:41,400 INFO: 856 packages to sync.
2016-10-19 11:07:41,415 INFO: Syncing package: 2 (serial 1386393)
2016-10-19 11:07:41,416 DEBUG: Getting /pypi/2/json (serial 1386393)
2016-10-19 11:07:41,421 INFO: Syncing package: AnywhereLibrary (serial 1060652)
Okay! It finished. status
is 2408615.
Running again ...
2016-10-19 11:24:08,149 INFO: bandersnatch/1.11 (CPython 2.7.12-final0, Linux 4.4.0-42-generic x86_64)
2016-10-19 11:24:08,160 INFO: Syncing with https://pypi.python.org.
2016-10-19 11:24:08,160 INFO: Current mirror serial: 2408615
2016-10-19 11:24:08,160 INFO: Syncing based on changelog.
2016-10-19 11:24:08,517 INFO: Trying to reach serial: 2410168
2016-10-19 11:24:08,518 INFO: 385 packages to sync.
2016-10-19 11:24:08,534 INFO: Syncing package: 1-.-8OO-.-681-.-7208_AVAST_Antivirus_Technical_Support_Phone_Number_by_Avast (serial 2409146)
2016-10-19 11:24:08,535 DEBUG: Getting /pypi/1-.-8OO-.-681-.-7208_AVAST_Antivirus_Technical_Support_Phone_Number_by_Avast/json (serial 2409146)
2016-10-19 11:24:08,538 INFO: Syncing package: 1-.-8OO-.-681-.-7208_AVIRA_Antivirus_Technical_Support_Phone_Number_by_Avira (serial 2409147)
2016-10-19 11:24:08,538 DEBUG: Getting /pypi/1-.-8OO-.-681-.-7208_AVIRA_Antivirus_Technical_Support_Phone_Number_by_Avira/json (serial 2409147)
Done. Status is now at 2410168.
Re-running.
2016-10-19 11:32:19,111 INFO: bandersnatch/1.11 (CPython 2.7.12-final0, Linux 4.4.0-42-generic x86_64)
2016-10-19 11:32:19,120 INFO: Syncing with https://pypi.python.org.
2016-10-19 11:32:19,124 INFO: Current mirror serial: 2410168
2016-10-19 11:32:19,124 INFO: Syncing based on changelog.
2016-10-19 11:32:19,225 INFO: Trying to reach serial: 2410197
2016-10-19 11:32:19,226 INFO: 9 packages to sync.
2016-10-19 11:32:19,227 INFO: Syncing package: Kuyruk (serial 2410182)
2016-10-19 11:32:19,227 DEBUG: Getting /pypi/Kuyruk/json (serial 2410182)
2016-10-19 11:32:19,229 INFO: Syncing package: component_builder (serial 2410197)
2016-10-19 11:32:19,229 DEBUG: Getting /pypi/component_builder/json (serial 2410197)
Resolving dependencies by running
pip install -r requirements.txt
and then sniffing the results is a reeeaalllyy inefficient way to go about it. Better is going to be loading each package index into a database. Let's do this once, starting with PyPI.First step is to basically download PyPI.