Backwards-compatible un-hashed package paths

ewdurbin commented 8 years ago

Originally reported by: Johannes Löthberg (Bitbucket: kyrias, GitHub: kyrias)

Hello,

For Arch Linux packages we very often use PyPI URLs to download the tarballs of software. The old process for updating was generally just changing the package version and updating the hash, and it would Just Work™. But with the new hashed package paths we need to go to the PyPI website manually and copy the URL to be able to update. This gets really annoying when having to update a lot of python packages.

Is there any chance that you could add a new URL path that lets you get all files without having to specify the hash? I understand why it was added, but it's useless for us since we check the checksum of the file ourselves, and just makes the process more annoying.

Bitbucket: https://bitbucket.org/pypa/pypi/issue/438

ewdurbin commented 8 years ago

Original comment by Donald Stufft (Bitbucket: dstufft, GitHub: dstufft):

@msarahan That was a bug, thanks for finding it, fix is at https://github.com/pypa/warehouse/pull/1132

ewdurbin commented 8 years ago

Original comment by Sean Farley (Bitbucket: seanfarley, GitHub: seanfarley):

I can't believe this is going through. I promise the number of busted packages and users affected will far outnumber whatever you see now.

ewdurbin commented 8 years ago

Original comment by Michael Sarahan (Bitbucket: msarahan, GitHub: msarahan):

Is there a new equivalent for XML-RPC, as documented at https://wiki.python.org/moin/PyPIXmlRpc? I can't get it to work with just changing pypi.python.org to pypi.io.

Traceback at https://gist.github.com/msarahan/8b8e06f1ef1a5823d09b4cefc7465412

ewdurbin commented 8 years ago

Original comment by Wesley Workman (Bitbucket: workmanw, GitHub: workmanw):

Oh, yeap. So it has. I checked yesterday and it was still busted. Today I just checked the original URL (https://pypi.python.org/packages/source/s/setuptools/setuptools-20.10.1.zip) instead of checking the script for changes. My mistake.

ewdurbin commented 8 years ago

Original comment by Donald Stufft (Bitbucket: dstufft, GitHub: dstufft):

ez_setup.py is already fixed.

ewdurbin commented 8 years ago

Original comment by Wesley Workman (Bitbucket: workmanw, GitHub: workmanw):

This also effects ez_setup.py and AFAIK a lot of people using buildout (see https://bootstrap.pypa.io/ez_setup.py [search for DEFAULT_URL]). We've had to fork the repo in order to restore our build capabilities. I'm just wondering if there is a timeline for getting this fix deployed.

ewdurbin commented 8 years ago

Original comment by Nico Kadel-Garcia (Bitbucket: nkadel, GitHub: nkadel):

Thanks for this. What is the timetable on that switch to PyPI 2.0? If it's not immediate, I'd call that fix "pending", not "resolved", if this ticket system supports it.

ewdurbin commented 8 years ago

Original comment by Donald Stufft (Bitbucket: dstufft, GitHub: dstufft):

Ok, I've deployed a fix to Warehouse (pypi.io // test.pypi.io) which is going to be PyPI 2.0 that will redirect the old URLs to the new URLs. I plan on switching legacy PyPI out and putting Warehouse in in the near future. I don't particularly feel like diving into legacy pypi's code base to muck around adding this to that, but if someone feels motivated to do it I can review a PR and deploy it.

For completeness sake, here's what this looks liked on Warehouse before the change:

$ curl -I https://pypi.io/packages/source/p/packaging/packaging-16.7.tar.gz
HTTP/1.1 404 Not Found
Content-Type: application/xml
Cache-Control: max-age=60, public
Fastly-Debug-Digest: 4b3d8ffedff15053cfc2124a3a46380df94276999bc75d4e4ec0ed2a1f54d629
Content-Length: 311
Accept-Ranges: bytes
Date: Sun, 24 Apr 2016 17:30:51 GMT
Age: 0
Connection: keep-alive
X-Served-By: cache-sea1923-SEA, cache-jfk1030-JFK
X-Cache: MISS, MISS
X-Cache-Hits: 0, 0
X-Timer: S1461519051.442718,VS0,VE99
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Frame-Options: deny
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Permitted-Cross-Domain-Policies: none

and here's what it looks like after the change:

$ curl -L -I https://pypi.io/packages/source/p/packaging/packaging-16.7.tar.gz
HTTP/1.1 301 Moved Permanently
Location: https://pypi.io/packages/28/ad/4e6601d14b11bb300719a8bb6247f6ef5861467a692523c978a4e9e3981a/packaging-16.7.tar.gz
Cache-Control: max-age=31536000, public
Content-Security-Policy: base-uri 'self'; block-all-mixed-content; connect-src 'self'; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self'; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.herokuapp.com/ https://secure.gravatar.com; referrer origin-when-cross-origin; reflected-xss block; script-src 'self'; style-src 'self' fonts.googleapis.com
Content-Type: text/html; charset=UTF-8
Fastly-Debug-Digest: e66f5cd5e7397d3d5c9616e0d1a3cfddfbe1660071e1bce2becc00643959f1ea
Content-Length: 295
Accept-Ranges: bytes
Date: Sun, 24 Apr 2016 18:36:39 GMT
Age: 100
Connection: keep-alive
X-Served-By: cache-iad2145-IAD, cache-jfk1030-JFK
X-Cache: MISS, HIT
X-Cache-Hits: 0, 1
X-Timer: S1461522999.470717,VS0,VE4
Vary: Accept-Encoding
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Frame-Options: deny
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Permitted-Cross-Domain-Policies: none

HTTP/1.1 200 OK
Last-Modified: Sat, 23 Apr 2016 22:16:49 GMT
ETag: "5bfeb52de8dee2fcc95a003b0ebe9011"
Content-Type: application/octet-stream
Cache-Control: max-age=31557600, public
Fastly-Debug-Digest: d315e9785068d13b023171d0245c7639d68a83c6ba1cecc5c7a57d6e90afc5a4
Content-Length: 44454
Accept-Ranges: bytes
Date: Sun, 24 Apr 2016 18:36:39 GMT
Age: 12979
Connection: keep-alive
X-Served-By: cache-sea1926-SEA, cache-jfk1030-JFK
X-Cache: HIT, HIT
X-Cache-Hits: 1, 1
X-Timer: S1461522999.487357,VS0,VE4
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Frame-Options: deny
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Permitted-Cross-Domain-Policies: none

ewdurbin commented 8 years ago

Original comment by Roman Bogorodskiy (Bitbucket: novel, GitHub: novel):

The same for FreeBSD ports as well.

Also, I'm wondering if re-uploads could be handled this by URL scheme something like this:

Original (0th) upload: /packages/{python version}/{name[0]}/{name}/{filename}
All the following (1..N) uploads: /packages/{python version}/{name[0]}/{name}/{upload_number}/{filename}

Anyway, IMHO, tarball rerolling should not be encouraged because it creates an additional burden for users and packages as they need to somehow make sure that no harmful changes were added while re-rolling, so package maintainers should probably try to avoid that practice and bump version instead.

ewdurbin commented 8 years ago

Original comment by Edd Barrett (Bitbucket: vext01, GitHub: vext01):

Same issue for OpenBSD ports.

ewdurbin commented 8 years ago

Original comment by Nico Kadel-Garcia (Bitbucket: nkadel, GitHub: nkadel):

There is an XKCD cartoon about this sort of workflow problem: https://xkcd.com/1172/ .But it really seems like this solution is making the man stand funny to make the suit look nice, described at (http://www.realnothings.com/famous%20jokes/suit.htm

I'm glad it's not breaking pip based installations, The problem for chef developers is a similar one to RPM developers, and I happen to be both for various open source toolkits. If we want or need to simply grab an updated, or obsolete, version of the tarball, we need some sort of simple command line way to grab that tarball, and 'wget' has been a much more scriptable and reliable way to get it, until now. This unexpected change is also breaking thousands of "Source" URLs in many RPM based operating systems. They're not normally used at build time, but revising them in the development system or in the .spec file for stable python module build tools is a lot of work. That includes over 200 SRPMs that I pursonally maintain, and roughly 1500 in the current Fedora development release.

It's a bigger problem than you may have realized.

ewdurbin commented 8 years ago

Original comment by Donald Stufft (Bitbucket: dstufft, GitHub: dstufft):

From my previous message:

I personally see affected users every couple of weeks without attempting to look for them (often times they're reaching out to me personally).

ewdurbin commented 8 years ago

Original comment by Chris Warrick (Bitbucket: Kwpolska, GitHub: Kwpolska):

How often did you actually experience the problems described (edit: that lead to the change)? Or is it a purely theoretical problem?

Because the breakage is real, and it should be fixed quickly.

ewdurbin commented 8 years ago

Original comment by Donald Stufft (Bitbucket: dstufft, GitHub: dstufft):

It doesn't appear to affect py2pack at all, as it looks like it uses one of the documented PyPI APIs to determine what URLs are available instead of just guessing based on an URL structure. I believe that Chef is also not affected since AFAIK they also use the documented APIs afaik. It's also not breaking wget (not sure how we could break wget in general), it's only breaking the ability to type in wget without looking up the URL ahead of time... the problem is you already can't do that in the general case because the URLs are case sensitive so you don't know if someone uploaded say django-1.0.tar.gz or Django-1.0.tar.gz or DjAnGo-1.0.tar.gz.

The problem it's solving is not particularly limited, I personally see affected users every couple of weeks without attempting to look for them (often times they're reaching out to me personally). I can only assume that the total affected users is larger than that since I doubt that every single person affected happens to idle in IRC channels that I idle in or know enough to reach out to me.

ewdurbin commented 8 years ago

Original comment by Sean Farley (Bitbucket: seanfarley, GitHub: seanfarley):

This seems like massive overkill for a limited problem.

Yeah, no joke. I also see this as more annoying than it's worth.

ewdurbin commented 8 years ago

Original comment by Nico Kadel-Garcia (Bitbucket: nkadel, GitHub: nkadel):

This seems like massive overkill for a limited problem. The resetting of a common, stable URL structure will break update workflow for many build and update environments, many of which are written in bash, Ruby, or obsolete but stable versions of Python that have never required such complexity. It's particularly likely to break chef and py2pack tools that I use for RPM based python module management, to ensure consistent versions and module control for my environments rather than simply running "pip install" and getting who knows what versions of dependencies on dependencies that are likely to break stable environments.

The benefit of this obfuscation of package locations seems very small, and also likely to interfere with developers who want simply download local copies of the tarballs for different releases for side-by-side code review. 'wget' works fine for this right now. Please don't break such useful tools.

ewdurbin commented 8 years ago

Original comment by Donald Stufft (Bitbucket: dstufft, GitHub: dstufft):

In addition to the above, I'm going to see about the possibility of running a translator service ourselves.

ewdurbin commented 8 years ago

Original comment by Donald Stufft (Bitbucket: dstufft, GitHub: dstufft):

Hi!

So here's a copy/paste that I sent to someone else about this issue:

So, previously PyPI used URLs like : /packages/{python version}/{name[0]}/{name}/{filename}

Now it uses: /packages/{hash[:2]}/{hash[2:4]}/{hash[4:]}/{filename} Where hash is blake2b(file_content, digest_size=32).hexdigest().lower()

There are a few reasons for this:

We generally do not allow people to delete a file and re-upload the same version again. However the old lay out generally means that we can't do that even if we wanted to because HTTP clients will use the URL as the key for a cache and thus it can never change (other than to be deleted).
The file system is not transactional and isn't part of the database, which means we get put in a funny pickle where we have to decide if we persist the change to the file system prior to committing the transaction or after committing. Both ways have their ups and downs and neither solves all of the issues. In general, on upload we try to save the file prior to committing because once it's been committed downstream users will expect it to exist and if we haven't saved the file to disk yet it may not yet exist yet (and if saving fails, it may never exist).

However, this raises a problem. We're currently using Amazon S3 to save files which is an eventually consistent data store. When writing a brand new file it will be (in the S3 region we're using) available immediately after writing a new file, however for writing a file that has already existed it can take some time for it to be consistent (reportedly being able to take up to hours for this to occur). This leaves us in a sticky situation where someone can run this:
```
setup.py sdist upload
```
And have PyPI accept the upload, write it to S3 and then fail to commit the upload. Then when the user re-runs that we'll write the file to S3 again (however it will have changed contents because setup.py sdist is not deterministic) and then commit the database, succeeding this time. If this happens then in the time period between when the database commits and when Amazon S3 has yet to update the file to the latest version (possibly taking hours) everyone is going to fail downloading/installing that file because the hash we're getting from Amazon S3 isn't going to match the hash that we have recorded in the PyPI database. To make this even more painful, we utilize download caching of the files pretty heavily and to do that we make the assumption that the contents at the URL will never change. So not only will it be broken in that window before Amazon S3 has become consistent, it will be persistently broken for anyone who attempted to install it until they go out of their way to delete their cache. By making the URL determined by the contents of the file, we make it so repeating the same upload with different contents will by definition end up with a different URL side stepping the entire problem.
When a file gets deleted from PyPI we have to delete it from the backing store too because the URL is predictable and people attempt to short circuit the Simple Repository API and we want a file deletion to, by default, mean that people don't discover that version. However, this flies in the face of people who use the simple repository API to resolve a version (or the Web UI) who then want to resolved URL into something with the expectation it will not change or go away. This change allows us to simply stop deleting files, so that if someone bakes a file URL into something it can continue to work into perpetuity without people accidentally installing that through simple URL building in the end user software.

Now even though the specific location of the file has not been considered part of our "API" nonetheless people have over time baked in assumptions about that URL scheme in various things, and obviously this change will break those things. So then how should someone deal with this change?

Well, the simplest (though perhaps not the least effort) is to remove whatever assumptions have been made and replace them with the new URL structure. This will fix things today, but it may or may not be the case that tomorrow the URL structure changes again.

Another option is to discover the final URL using a method similar to what pip does. The protocol is documented in PEP 503, but generally what you need to do is look at /simple// and see what links are available there. That will tell you all of the files that currently exist for that project.

Yet another option is to run a sort of "translator" service that can consume the PyPI JSON API and will output the URLs in whatever format best suites you. An example of this is pypi.debian.net (which I don't know where the code base for it is, but the proof of concept I wrote for it is at https://github.com/dstufft/pypi-debian). These translators are fairly simple, they take an URL, pull the project and filename out of it and then use the JSON API to figure out the "real" URL and then just simply redirects to that.

ewdurbin commented 8 years ago

Original comment by Davide Liessi (Bitbucket: dliessi, GitHub: dliessi):

The same for MacPorts. Very annoying.

ewdurbin commented 8 years ago

Original comment by Sander Hoentjen (Bitbucket: tjikkun, GitHub: tjikkun):

As a Fedora maintainer I am also currently experiencing this.

ewdurbin commented 8 years ago

Original comment by kwilcox (Bitbucket: kwilcox, GitHub: kwilcox):

I am in a similar situation as @kyrias with my package requirements... it is now impossible to determine the tarball download link using the package name. I was also just swapping out the version number and MD5 to update packages and this is no longer possible.

toddrme2178 commented 6 years ago

I am getting a 404 when I try to download a wheel from the following address: https://files.pythonhosted.org/packages/py2.py3/n/numericalunits/numericalunits-1.17-py2.py3-none-any.whl, but the corresponding hash-based address still works: https://pypi.python.org/packages/89/2a/950938408b4eb49649802e49646c37a7caa57364e54dc2d832a71923475d/numericalunits-1.17-py2.py3-none-any.whl

alhirzel commented 5 years ago

I believe something about this interface recently became case-sensitive (noticed when packaging OMPython).

pypi / legacy

Backwards-compatible un-hashed package paths #438