pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.54k stars 955 forks source link

ETag on artifacts seems inconsistent #15387

Open thatch opened 7 months ago

thatch commented 7 months ago

Describe the bug I'm trying to issue Range requests for some bytes out of pypi wheels. The ETag response header is not constant, and I think it should be.

The immediate application is some code to extract metadata (yes, I know it's being backfilled), but would like to use this same code as a stopgap to enable targeted malware scanning of large wheels without downloading the whole thing.

I need to figure out if this is something that can be fixed, or if I need to work around changing etags by sniffing the Cache-Control: immutable or something.

Expected behavior Requesting ranges from the beginning and end of wheels returns the same etag and content-type.

To Reproduce

$ curl -I https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl 
...
etag: "40d570a6b0f1aa3c1de02d4bf7b8e3ae-2"
content-type: binary/octet-stream

$ curl -I -H "Range: bytes=0-1" https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl
...
etag: "40d570a6b0f1aa3c1de02d4bf7b8e3ae-2"
content-type: binary/octet-stream

$ curl -I -H "Range: bytes=10000000-10000001" https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl
etag: "a0ee78f286e2d5f37a25879378e0c316"
content-type: application/octet-stream

The last-modified dates also differ by a few seconds, and the server and x-amz-server-side-encryption response headers are only present when requesting the beginning of the file. The cutoff appears to be at 3145728 bytes (3 * 2**20).

I'm assuming this is either bad cached content (if the etag has definitively changed) or a misbehavior in s3 (the docs are unclear on whether it can change over time, and my read of the mozilla writeup on etags is that it shouldn't depend on the range). Happy to work around it if you have ideas for doing so safely. The caching layer doesn't appear to support If-Range requests so I went down the road of verifying them myself.

ewdurbin commented 7 months ago

I cannot find docs on the expected contents of our B2 backends ETag, but S3 is pretty explicit

ETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below: ...

In this case, I did confirm that the stored objects are identical at least.

It is clear in your case that some responses are coming from S3 and some from B2 which leads to the ETag mismatch. Due to how our caches are populated there is a time where objects (or segments of objects) may be pulled from S3 before B2 has the object.

It doesn't appear that there is any header which carries a common value we could use to ensure consistency of ETag across resources.

Should we just stop serving the header to avoid confusion?

dstufft commented 7 months ago

I think we can set our own ETag in vcl_fetch, probably right here https://github.com/pypi/infra/blob/main/terraform/file-hosting/vcl/files.vcl#L154

We would just need a value to set that ETag too. We have a hash in the URL we could extract and use, or even sillier:

set beresp.http.ETag = digest.hash_md5(req.url);

Or we could add a value here: https://github.com/pypi/warehouse/blob/main/warehouse/forklift/legacy.py#L1455-L1459 which would be available as beresp.http.x-pypi-whatever (I think? I don't see where the pypi- prefix is being added, but we're referencing it here: https://github.com/pypi/infra/blob/main/terraform/file-hosting/vcl/files.vcl#L589).

dstufft commented 7 months ago

I also think it's fine to remove ETag, we have cache headers that say to cache these responses for effectively forever, so I don't think we should be getting revalidate checks very often.