pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.6k stars 963 forks source link

stale gzip encoded /simple/ index on pypi.org #5494

Open coryb opened 5 years ago

coryb commented 5 years ago

Describe the bug

The https://pypi.org/simple/ index is stale when fetching gzip encoded.

Expected behavior

the x-pypi-last-serial header should be the same or close when fetching the /simple/ index with and without gzip compression.

To Reproduce

Fetch serial for non-compressed index:

$ curl -qs -D- https://pypi.org/simple/ | grep last-serial
x-pypi-last-serial: 4872850

Fetch serial for compressed index:

$ curl -qs -D- -H "Accept-Encoding: gzip" https://pypi.org/simple/ | grep last-serial
x-pypi-last-serial: 4869648

Currently the difference is ~3200 generations.

My Platform

Noticed this with an internal corporate proxy that requests gzip'd data by default for performance reasons. We can request uncompressed data, but that would just waste your network bandwidth.

Additional context

coryb commented 5 years ago

Some more curious data points: Using Accept-Encoding: deflate provides the most recent serial, but the server does not actually compress the content, it is raw:

$ curl -qs -D- -H "Accept-Encoding: deflate" -o /dev/null https://pypi.org/simple/ | grep -E serial\|length
x-pypi-last-serial: 4880578
content-length: 9112165

The identity request is actually older than the uncompressed-deflate request:

$ curl -qs -D- -H "Accept-Encoding: identity" -o /dev/null https://pypi.org/simple/ | grep -E serial\|length
x-pypi-last-serial: 4877832
content-length: 9108586

The gzip compression is the oldest serial number, but it does actually compress the response:

$ curl -qs -D- -H "Accept-Encoding: gzip" -o /dev/null https://pypi.org/simple/ | grep -E serial\|length
x-pypi-last-serial: 4874125
content-length: 1400982
bmoyles commented 5 years ago

After poking around, it seems like this might be the same as #4892