Open chadwhitacre opened 8 years ago
New status is 2410197.
Re-ran, updated three packages, status is 2410203.
Okay!
So it looks like there will be order of magnitude 1,000s of packages to update if we run bandersnatch once a day. Call it 10% of the total, so we could probably manage with 50 GB of disk for $5/mo.
Looking at file types (h/t):
root@gdr:/mnt/pypi# find web/packages -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' > extensions &
root@gdr:/mnt/pypi# cat extensions | sort | uniq -c | sort -n
5 deb
6 dmg
25 tgz
60 rpm
64 msi
353 bz2
1464 exe
5239 egg
6556 zip
9905 whl
50351 gz
What if the only tarball we have for a release is an MSI?
Or exe
, more likely.
I guess let's focus on gz
, whl
, zip
, egg
, bz2
, and tgz
.
So it looks like there will be order of magnitude 1,000s of packages to update if we run bandersnatch once a day. Call it 10% of the total, so we could probably manage with 50 GB of disk for $5/mo.
I guess it's a time/space trade-off. Fresh mirrors are within an hour. If we update every 30 minutes then it should be 100s or maybe even 10s of packages, and we can probably manage with 5 GB or maybe even 1 GB.
If it's under 1 GB then we can keep it on the droplet and not use a separate volume, though if we're going to run a database at Digital Ocean to back gdr.rocks then we should maybe store it on a volume for good decoupling.
Managed Postgres at DO starts at $19/mo.
Alright, let's keep this lightweight. One $5/mo droplet, local Postgres. The only reason we are mirroring PyPI is to extract dependency info, which we can't get from metadata. We don't need to store all metadata, because PyPI itself gives us a JSON API, which we can even hit from the client side if we want to (I checked: we have Access-Control-Allow-Origin: *
). That should be sufficient to populate the /on/pypi/*
pages.
Okay! I think I've figured out incremental updates. Bandersnatch needs a generation
file or it'll start from scratch. That refers to the bandersnatch schema version, basically. 5
is the latest. And then it needs a status
file with the serial number we want to start from (i.e., the last seen, ... hmm—on which side is it inclusive?). Then it needs a configuration file. That's all it needs to do an incremental update! We can rm -rf web
(the directory it downloads into). We can throw away the todo
file. As long as we have a conf
, generation
, and status
, bandersnatch will happily sync based on the changelog.
Now, it will over-download, but if we process frequently enough, we should be okay. It looks like if we process every 30 minutes then we'll have well less than 100 packages to update. Packages generally have well less than 100 release files, though when Requests or Django pushes a new release we'll have a lot of old ones to download. I guess we want to tune the cron to run frequently enough to keep the modal batch size small, while still giving us enough time to complete processing for the occasional larger batch. Logging ftw.
That should be sufficient to populate the
/on/pypi/*
pages.
On the other hand, the JSON is heavy (100s of kB), and the description
field is a mess. We might want to do our own README analysis while we've got the tarballs cracked. Hmm ...
How about we grab README
s while we're in there as well as long_description
from setup
. That way we'll at least have them if we want to do something with them later. What if there are multiple README
files? README.rst
, README.md
, ...
Since we're going to be importing untrusted setup.py
modues we probably still want the Docker sandbox.
But we'd have it in the tarchomper process instead of in the web app.
Extension finder died mid-write. :]
root@gdr:/mnt/pypi# cat extensions | sort | uniq -c | sort -n
1 g
1 ZIP
24 deb
39 dmg
187 tgz
417 rpm
424 msi
2717 bz2
11684 exe
40647 egg
50619 zip
77144 whl
391041 gz
Okay! So! tarchomper
! 🎯
cp status status.bak
to save our place in case the process crashesrm -rf web todo
to start from a clean slatebandersnatch -c conf mirror
to fetch all tarballs for packages where there have been changesCOPY
?Nomenclature update:
name
, e.g., requests
name
, version
), e.g., (requests
, 2.11.1
)requests-2.11.1.tar.gz
(Note: projects listed in
setup_requires
will NOT be automatically installed on the system where the setup script is being run. They are simply downloaded to the ./.eggs directory if they’re not locally available already. If you want them to be installed, as well as being available when the setup script is run, you should add them toinstall_requires
andsetup_requires
.)
http://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords
Okay, let's not worry about setup_requires
. We don't need test_requires
either, since that's a dependency of the project itself, not the project's users.
On the other hand we should include extras_require
only if the extras are in use by the downstream package/application. Hmm ... optionality.
Blorg. Tests are failing after installing bandersnatch, because it install_requires
some pytest plugins. I guess the workaround is to manually uninstall these. We'll have to teach Travis to do the same.
PR in #5.
In light of the shift of focus at https://github.com/gratipay/gratipay.com/pull/4135#issuecomment-254988315, I've removed the droplet, volume, and floating IP from Digital Ocean to avoid incurring additional cost.
Resolving dependencies by running
pip install -r requirements.txt
and then sniffing the results is a reeeaalllyy inefficient way to go about it. Better is going to be loading each package index into a database. Let's do this once, starting with PyPI.First step is to basically download PyPI.