Open thatch opened 7 months ago
I cannot find docs on the expected contents of our B2 backends ETag, but S3 is pretty explicit
ETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below: ...
In this case, I did confirm that the stored objects are identical at least.
It is clear in your case that some responses are coming from S3 and some from B2 which leads to the ETag mismatch. Due to how our caches are populated there is a time where objects (or segments of objects) may be pulled from S3 before B2 has the object.
It doesn't appear that there is any header which carries a common value we could use to ensure consistency of ETag across resources.
Should we just stop serving the header to avoid confusion?
I think we can set our own ETag
in vcl_fetch
, probably right here https://github.com/pypi/infra/blob/main/terraform/file-hosting/vcl/files.vcl#L154
We would just need a value to set that ETag
too. We have a hash in the URL we could extract and use, or even sillier:
set beresp.http.ETag = digest.hash_md5(req.url);
Or we could add a value here: https://github.com/pypi/warehouse/blob/main/warehouse/forklift/legacy.py#L1455-L1459 which would be available as beresp.http.x-pypi-whatever
(I think? I don't see where the pypi-
prefix is being added, but we're referencing it here: https://github.com/pypi/infra/blob/main/terraform/file-hosting/vcl/files.vcl#L589).
I also think it's fine to remove ETag
, we have cache headers that say to cache these responses for effectively forever, so I don't think we should be getting revalidate checks very often.
Describe the bug I'm trying to issue Range requests for some bytes out of pypi wheels. The ETag response header is not constant, and I think it should be.
The immediate application is some code to extract metadata (yes, I know it's being backfilled), but would like to use this same code as a stopgap to enable targeted malware scanning of large wheels without downloading the whole thing.
I need to figure out if this is something that can be fixed, or if I need to work around changing etags by sniffing the
Cache-Control: immutable
or something.Expected behavior Requesting ranges from the beginning and end of wheels returns the same etag and content-type.
To Reproduce
The
last-modified
dates also differ by a few seconds, and theserver
andx-amz-server-side-encryption
response headers are only present when requesting the beginning of the file. The cutoff appears to be at 3145728 bytes (3 * 2**20
).I'm assuming this is either bad cached content (if the etag has definitively changed) or a misbehavior in s3 (the docs are unclear on whether it can change over time, and my read of the mozilla writeup on etags is that it shouldn't depend on the range). Happy to work around it if you have ideas for doing so safely. The caching layer doesn't appear to support
If-Range
requests so I went down the road of verifying them myself.