Load PyPI dependency info into a database

chadwhitacre commented 8 years ago

Resolving dependencies by running pip install -r requirements.txt and then sniffing the results is a reeeaalllyy inefficient way to go about it. Better is going to be loading each package index into a database. Let's do this once, starting with PyPI.

First step is to basically download PyPI.

chadwhitacre commented 8 years ago

New status is 2410197.

chadwhitacre commented 8 years ago

Re-ran, updated three packages, status is 2410203.

chadwhitacre commented 8 years ago

Okay!

chadwhitacre commented 8 years ago

So it looks like there will be order of magnitude 1,000s of packages to update if we run bandersnatch once a day. Call it 10% of the total, so we could probably manage with 50 GB of disk for $5/mo.

chadwhitacre commented 8 years ago

Looking at file types (h/t):

root@gdr:/mnt/pypi# find web/packages -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' > extensions &
root@gdr:/mnt/pypi# cat extensions | sort | uniq -c | sort -n
      5 deb
      6 dmg
     25 tgz
     60 rpm
     64 msi
    353 bz2
   1464 exe
   5239 egg
   6556 zip
   9905 whl
  50351 gz

chadwhitacre commented 8 years ago

What if the only tarball we have for a release is an MSI?

chadwhitacre commented 8 years ago

Or exe, more likely.

chadwhitacre commented 8 years ago

I guess let's focus on gz, whl, zip, egg, bz2, and tgz.

chadwhitacre commented 8 years ago

So it looks like there will be order of magnitude 1,000s of packages to update if we run bandersnatch once a day. Call it 10% of the total, so we could probably manage with 50 GB of disk for $5/mo.

I guess it's a time/space trade-off. Fresh mirrors are within an hour. If we update every 30 minutes then it should be 100s or maybe even 10s of packages, and we can probably manage with 5 GB or maybe even 1 GB.

chadwhitacre commented 8 years ago

If it's under 1 GB then we can keep it on the droplet and not use a separate volume, though if we're going to run a database at Digital Ocean to back gdr.rocks then we should maybe store it on a volume for good decoupling.

chadwhitacre commented 8 years ago

Managed Postgres at DO starts at $19/mo.

chadwhitacre commented 8 years ago

Alright, let's keep this lightweight. One $5/mo droplet, local Postgres. The only reason we are mirroring PyPI is to extract dependency info, which we can't get from metadata. We don't need to store all metadata, because PyPI itself gives us a JSON API, which we can even hit from the client side if we want to (I checked: we have Access-Control-Allow-Origin: *). That should be sufficient to populate the /on/pypi/* pages.

chadwhitacre commented 8 years ago

Okay! I think I've figured out incremental updates. Bandersnatch needs a generation file or it'll start from scratch. That refers to the bandersnatch schema version, basically. 5 is the latest. And then it needs a status file with the serial number we want to start from (i.e., the last seen, ... hmm—on which side is it inclusive?). Then it needs a configuration file. That's all it needs to do an incremental update! We can rm -rf web (the directory it downloads into). We can throw away the todo file. As long as we have a conf, generation, and status, bandersnatch will happily sync based on the changelog.

Now, it will over-download, but if we process frequently enough, we should be okay. It looks like if we process every 30 minutes then we'll have well less than 100 packages to update. Packages generally have well less than 100 release files, though when Requests or Django pushes a new release we'll have a lot of old ones to download. I guess we want to tune the cron to run frequently enough to keep the modal batch size small, while still giving us enough time to complete processing for the occasional larger batch. Logging ftw.

chadwhitacre commented 8 years ago

That should be sufficient to populate the /on/pypi/* pages.

On the other hand, the JSON is heavy (100s of kB), and the description field is a mess. We might want to do our own README analysis while we've got the tarballs cracked. Hmm ...

chadwhitacre commented 8 years ago

How about we grab READMEs while we're in there as well as long_description from setup. That way we'll at least have them if we want to do something with them later. What if there are multiple README files? README.rst, README.md, ...

chadwhitacre commented 8 years ago

https://docs.python.org/2/distutils/setupscript.html#additional-meta-data

chadwhitacre commented 8 years ago

https://docs.python.org/2/distutils/packageindex.html

chadwhitacre commented 8 years ago

Since we're going to be importing untrusted setup.py modues we probably still want the Docker sandbox.

chadwhitacre commented 8 years ago

But we'd have it in the tarchomper process instead of in the web app.

chadwhitacre commented 8 years ago

Extension finder died mid-write. :]

root@gdr:/mnt/pypi# cat extensions | sort | uniq -c | sort -n
      1 g
      1 ZIP
     24 deb
     39 dmg
    187 tgz
    417 rpm
    424 msi
   2717 bz2
  11684 exe
  40647 egg
  50619 zip
  77144 whl
 391041 gz

chadwhitacre commented 8 years ago

Okay! So! tarchomper! 🎯

chadwhitacre commented 8 years ago

cp status status.bak to save our place in case the process crashes
rm -rf web todo to start from a clean slate
bandersnatch -c conf mirror to fetch all tarballs for packages where there have been changes
walk the tree for the new tarballs
for each tarball, open it up and extract the info we need
spit out SQL—COPY?
run the SQL

chadwhitacre commented 8 years ago

Nomenclature update:

application—the top-level thing for which we are resolving a comprehensive list of dependencies
package—the thing identified by name, e.g., requests
release—the thing identified by (name, version), e.g., (requests, 2.11.1)
artifact —actual file (been calling this "tarball" above), e.g., requests-2.11.1.tar.gz

chadwhitacre commented 8 years ago

(Note: projects listed in setup_requires will NOT be automatically installed on the system where the setup script is being run. They are simply downloaded to the ./.eggs directory if they’re not locally available already. If you want them to be installed, as well as being available when the setup script is run, you should add them to install_requires and setup_requires.)

http://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords

Okay, let's not worry about setup_requires. We don't need test_requires either, since that's a dependency of the project itself, not the project's users.

On the other hand we should include extras_require only if the extras are in use by the downstream package/application. Hmm ... optionality.

chadwhitacre commented 8 years ago

Blorg. Tests are failing after installing bandersnatch, because it install_requires some pytest plugins. I guess the workaround is to manually uninstall these. We'll have to teach Travis to do the same.

chadwhitacre commented 8 years ago

PR in #5.

chadwhitacre commented 8 years ago

In light of the shift of focus at https://github.com/gratipay/gratipay.com/pull/4135#issuecomment-254988315, I've removed the droplet, volume, and floating IP from Digital Ocean to avoid incurring additional cost.

gratipay / gdr.rocks

Load PyPI dependency info into a database #2