Load PyPI dependency info into a database

chadwhitacre commented 7 years ago

Resolving dependencies by running pip install -r requirements.txt and then sniffing the results is a reeeaalllyy inefficient way to go about it. Better is going to be loading each package index into a database. Let's do this once, starting with PyPI.

First step is to basically download PyPI.

chadwhitacre commented 7 years ago

https://twitter.com/whit537/status/788359742271791104

chadwhitacre commented 7 years ago

https://packaging.python.org/mirrors/ https://pypi.org/ https://pypi.python.org/mirrors https://www.pypi-mirrors.org/ https://pypi.python.org/pypi/bandersnatch https://bitbucket.org/pypa/bandersnatch/ https://github.com/openstack-infra/pypi-mirror https://github.com/pypa/warehouse/ http://doc.devpi.net/latest/

chadwhitacre commented 7 years ago

Ultimately what we want is to pass in a requirements.txt and/or a setup.py and get back a structure representing the dependencies.

Hmm ... https://graphcommons.com/

chadwhitacre commented 7 years ago

Jackpot!

https://github.com/anvaka/allpypi

chadwhitacre commented 7 years ago

https://github.com/koder-ua/python_deps

chadwhitacre commented 7 years ago

Well, jackpot: https://github.com/anvaka/pm#individual-visualizations. Indexers for 13 ecosystems!

chadwhitacre commented 7 years ago

As of https://mail.python.org/pipermail/distutils-sig/2015-January/025683.html we don't have to worry about mutability in PyPI. That means we never need to update info once we have it. It's possible to delete packages but I think we don't want to do that. We want to keep old info around.

chadwhitacre commented 7 years ago

And we only need one-week granularity. If we update once a day we'll be well inside our loop.

chadwhitacre commented 7 years ago

Where the mutability comes in is that dependencies are subject to a range. If I depend on Foo >= 1.0, then when Foo 1.1 comes out my dependency chain will need to be updated.

chadwhitacre commented 7 years ago

What's the data structure we want?

chadwhitacre commented 7 years ago

We want to support taking in a list of files of type text/x-python (for setup.py) and/or text/plain (for requirements.txt), and returning a single flattened list of dependencies with this info:

package_manager—PyPI, but also GitHub, Bitbucket, ... probably Git? SCMs in general?
name—name for PyPI, repo name for GitHub, Bitbucket, ... something for SCMs?
url—on package manager
license—a string or url for the actual license
osi_license—a link under https://opensource.org/licenses; null indicates non OSI-approved
version—the single latest version satisfying version_range; null means dep hell
version_range—a single range computed from folding together required_by; empty set indicates dependency hell
required_by—a list of (location, version range) tuples
- location for text/plain—filename or URL; and line number
- location for text/x-python—filename, URL, or package; parameter (*_require{s}); and index in the argument

Take care to handle different files with the same name in the upload.

chadwhitacre commented 7 years ago

Two more hops:

database schema to support that query
ETL flow to populate the database from PyPI

ewdurbin commented 7 years ago

Briefly spoke on the phone with @whit537 and summarized the basic guidance for PyPI data analytics that require traversing dependency trees.

Note that right now metadata available via PyPI is limited and may be for some time as PEP 426 is indefinitely deferred. Information is available in the deferment section of the PEP describing other PEPs addressing these topics.

In order to crawl PyPI for dependency links, you'll need general metadata for "indexing" as well as the package files themselves to obtain dependency information via setuptools/distutils.

Recommended approach:

Setup and maintain a bandersnatch based mirror of PyPI's simple index and package data (wheels, tarballs, zips, exes... all of it) (Around 450GB right now)
Use the simple index or XMLRPC list_packages() call to retrieve a list of packages registered with PyPI.
Use individual packages JSON API https://pypi.org/pypi/<package_name>/json to retrieve metadata and releases for individual packages (https://pypi.org/pypi/requests/json for example)
For specific releases, use the JSON API https://pypi.org/pypi/<package_name>/<release_identifier>/json (https://pypi.org/pypi/requests/2.11.1/json for example)
In order to obtain dependency information for a given package and release, you'll need to read in from the artifact itself (obtained via bandersnatch)

All of the above endpoints and tools with the exception of the XMLRPC are designed to minimize impact on the PyPI backend infrastructure, as they are easily cached in our CDN.

chadwhitacre commented 7 years ago

A table for releases, unique on (package_manager_id, package_id, version), also has path, license, and osi_license columns.

A releases.deps column stores upstream dependencies, which are self-references to specific releases, along with version_range and required_by for each. Resolving a set of dependencies is then a matter of merging the deps chains for the input set. The process used to precompute deps should be usable on-the-fly for queries.

PyPI gives us a changelog that includes new release events. With that, we should be able to recompute deps for the subset of affected packages; we'll need a reverse mapping (release_id, required_by) for that. Basically, when a new release comes out we want to:

compute deps for that release, and
go through everything that depends on any other version of the newly released package, and update deps.

We'll need to keep a table of packages, and do the reverse mapping based on that. Something like:

for each package in might_depend_on(new_release):
    for release in package.releases:
        release.update_deps(new_release)

chadwhitacre commented 7 years ago

[edit—what he said :]

chadwhitacre commented 7 years ago

So for ETL we can look at bandersnatch ... before the call I had been thinking we'd roll our own (I'd already started based on https://github.com/gratipay/gdr.rocks/issues/2#issuecomment-254511187) that would look like this:

one xmlrpc call to get the list of all names
download json for each package
split the json into multiple files: one for the package, and one for each release
walk the release json to download tarballs
crack open tarballs to discover requirements info, load into db
watch for new release in the changelog XML-RPC
download the release json for new releases, and update the db

chadwhitacre commented 7 years ago

Another point @ewdurbin made on the phone is that some projects vendor in their dependencies (e.g., that's how Requests uses urllib3), and an approach that looks only at setup.py and requirements.txt won't pick that up.

chadwhitacre commented 7 years ago

I've downloaded and run a bit of bandersnatch. I am finding tarballs. I think we should be able to get what we need from that, without having to resort to the JSON API (bandersnatch does use fetch JSON under the hood, but afaict it throws it away) . The name, version, and license are in the PKG-INFO (is PKG-INFO guaranteed to exist and have those keys?). With the name we can compute the url. osi_license will be something we compute based on license. Depenedency info we've already said we need to extract from the tarballs.

One issue with bandersnatch is that it doesn't download tarballs that aren't on PyPI.

Another is that we don't actually need to keep the tarballs around after we process them. Doing so would cost about $50/mo at Digital Ocean. Will we be able to easily convince bandersnatch not to redownload things we've already downloaded and then deleted?

chadwhitacre commented 7 years ago

D'oh! :-/

screen shot 2016-10-18 at 3 07 39 pm

chadwhitacre commented 7 years ago

If we can delete old tarballs without tripping up bandersnatch, then we should be able to run a bandersnatch process, and a second process to consume tarballs: ETL them and then throw them away. This second process can run cronishly, offset from bandersnatch, and simply walk the tree looking for tarballs.

chadwhitacre commented 7 years ago

I've moved http://gdr.rocks/ over to NYC1 and am attaching a 500 GB volume.

chadwhitacre commented 7 years ago

Derp. Volume are only resizable up.

screen shot 2016-10-18 at 3 28 05 pm

chadwhitacre commented 7 years ago

cd /mnt/pypi/
virtualenv .
bin/pip install bandersnatch
bin/bandersnatch -c conf mirror
vim conf
  directory = /mnt/pypi
  delete-packages = false
nohup bin/bandersnatch -c conf mirror &

chadwhitacre commented 7 years ago

grep 'Storing index page' nohup.out indicates that it's processed about 1% of records so far.

chadwhitacre commented 7 years ago

That puts us at about eight hours to finish.

chadwhitacre commented 7 years ago

Okay! Let's do some local testing wrt snatching tarballs out from under bandersnatch. Also: ETL.

chadwhitacre commented 7 years ago

From reading through mirror.py, it looks like we should be able to satisfy bandersnatch with a state file that records the serial number we are satisfied that we're good through. What is a serial?

chadwhitacre commented 7 years ago

Here's what it looks like when I echo 2229089 > status and rerun:

[gdr]$ bandersnatch -c conf mirror
2016-10-18 16:08:17,248 INFO: bandersnatch/1.11 (CPython 2.7.11-final0, Darwin 14.5.0 x86_64)
2016-10-18 16:08:17,248 INFO: Removing inconsistent todo list.
2016-10-18 16:08:17,249 INFO: Syncing with https://pypi.python.org.
2016-10-18 16:08:17,250 INFO: Current mirror serial: 2229089
2016-10-18 16:08:17,250 INFO: Syncing based on changelog.

chadwhitacre commented 7 years ago

The weird thing is that on the first run through, it processes packages in alphabetical order by name, not in numeric order by serial. It only writes status after a successful sync. On subsequent runs, it uses the changelog RPC. But what is it doing with serial in that case?

chadwhitacre commented 7 years ago

It actually sorts alphabetically in either case.

chadwhitacre commented 7 years ago

Already in the Ms. Maybe I calculated wrong?

chadwhitacre commented 7 years ago

How does it differentiate new releases from old when syncing based on changelog?

chadwhitacre commented 7 years ago

How does it differentiate new releases from old when syncing based on changelog?

Hrm. As I read it, it doesn't. :-/

The set of packages to sync is put on the queue. Workers call package.sync. It loads the JSON for the package as a whole, and then iterates over all releases. If a release file doesn't exist or doesn't pass a checksum, it redownloads it. Shucks! :-/

chadwhitacre commented 7 years ago

Gonna move on to extraction for now. Will have to come back to that later (probably after MVP).

chadwhitacre commented 7 years ago

I've downloaded 229 releases locally to play with.

chadwhitacre commented 7 years ago

Yeah, it's gonna be eight hours. ☺️

2016-10-18 21:09:48,606 INFO: Storing index page: Zwiki
2016-10-18 21:09:49,062 INFO: Storing index page: a

chadwhitacre commented 7 years ago

Been 1.5 so far, or 18.75%. Disk is 14% full, and logfile is on its way to 31 MB / 18.75% = 165 MB.

chadwhitacre commented 7 years ago

I'm not sure we care about requirements.txt inside packages. If that's just there for the developers of the projects to use while developing, then end-users don't really have those as a direct dependency. Insofar as they are true requirements of the package, they'll be sourced into setup.py. I think we just focus on setup.py.

chadwhitacre commented 7 years ago

Hrm. We only need one tarball per release, but we want exactly one per release. PyPI/bandersnatch doesn't organize tarballs on the filesystem in a way that makes it easy to accomplish this.

chadwhitacre commented 7 years ago

I guess we can handle that if we ETL in a tight loop instead of three big loops. When we first crack a tarball we can see if the name and version are already in our database before processing it further.

chadwhitacre commented 7 years ago

Because bandersnatch supposedly guarantees that we do have all of the tarballs.

chadwhitacre commented 7 years ago

We'll have to unpack wheels as well as zips and tgzs. Anything else, I wonder? Eggs?

chadwhitacre commented 7 years ago

Half-way done! 39% full disk. 130 MB logfile.

chadwhitacre commented 7 years ago

Done! 359 GB, with a 255 MB logfile.

chadwhitacre commented 7 years ago

Eff. I re-ran bandersnatch -c conf mirror and then Ctrl-C'd it, and now I think I may have lost a status file or something that would prevent a full resync. :-/

chadwhitacre commented 7 years ago

Okay, seems like not quite that bad:

root@gdr:/mnt/pypi# mv nohup.out initial.log
root@gdr:/mnt/pypi# cp todo todo.bak
root@gdr:/mnt/pypi# head nohup.out 
2016-10-19 11:07:41,396 INFO: bandersnatch/1.11 (CPython 2.7.12-final0, Linux 4.4.0-42-generic x86_64)
2016-10-19 11:07:41,397 INFO: Status file missing. Starting over.
2016-10-19 11:07:41,397 INFO: Syncing with https://pypi.python.org.
2016-10-19 11:07:41,397 INFO: Current mirror serial: 0
2016-10-19 11:07:41,397 INFO: Resuming interrupted sync from local todo list.
2016-10-19 11:07:41,400 INFO: Trying to reach serial: 2408615
2016-10-19 11:07:41,400 INFO: 856 packages to sync.
2016-10-19 11:07:41,415 INFO: Syncing package: 2 (serial 1386393)
2016-10-19 11:07:41,416 DEBUG: Getting /pypi/2/json (serial 1386393)
2016-10-19 11:07:41,421 INFO: Syncing package: AnywhereLibrary (serial 1060652)

chadwhitacre commented 7 years ago

Okay! It finished. status is 2408615.

chadwhitacre commented 7 years ago

Running again ...

2016-10-19 11:24:08,149 INFO: bandersnatch/1.11 (CPython 2.7.12-final0, Linux 4.4.0-42-generic x86_64)
2016-10-19 11:24:08,160 INFO: Syncing with https://pypi.python.org.
2016-10-19 11:24:08,160 INFO: Current mirror serial: 2408615
2016-10-19 11:24:08,160 INFO: Syncing based on changelog.
2016-10-19 11:24:08,517 INFO: Trying to reach serial: 2410168
2016-10-19 11:24:08,518 INFO: 385 packages to sync.
2016-10-19 11:24:08,534 INFO: Syncing package: 1-.-8OO-.-681-.-7208_AVAST_Antivirus_Technical_Support_Phone_Number_by_Avast (serial 2409146)
2016-10-19 11:24:08,535 DEBUG: Getting /pypi/1-.-8OO-.-681-.-7208_AVAST_Antivirus_Technical_Support_Phone_Number_by_Avast/json (serial 2409146)
2016-10-19 11:24:08,538 INFO: Syncing package: 1-.-8OO-.-681-.-7208_AVIRA_Antivirus_Technical_Support_Phone_Number_by_Avira (serial 2409147)
2016-10-19 11:24:08,538 DEBUG: Getting /pypi/1-.-8OO-.-681-.-7208_AVIRA_Antivirus_Technical_Support_Phone_Number_by_Avira/json (serial 2409147)

chadwhitacre commented 7 years ago

Done. Status is now at 2410168.

chadwhitacre commented 7 years ago

Re-running.

2016-10-19 11:32:19,111 INFO: bandersnatch/1.11 (CPython 2.7.12-final0, Linux 4.4.0-42-generic x86_64)
2016-10-19 11:32:19,120 INFO: Syncing with https://pypi.python.org.
2016-10-19 11:32:19,124 INFO: Current mirror serial: 2410168
2016-10-19 11:32:19,124 INFO: Syncing based on changelog.
2016-10-19 11:32:19,225 INFO: Trying to reach serial: 2410197
2016-10-19 11:32:19,226 INFO: 9 packages to sync.
2016-10-19 11:32:19,227 INFO: Syncing package: Kuyruk (serial 2410182)
2016-10-19 11:32:19,227 DEBUG: Getting /pypi/Kuyruk/json (serial 2410182)
2016-10-19 11:32:19,229 INFO: Syncing package: component_builder (serial 2410197)
2016-10-19 11:32:19,229 DEBUG: Getting /pypi/component_builder/json (serial 2410197)

gratipay / gdr.rocks

Load PyPI dependency info into a database #2