pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.54k stars 955 forks source link

Support Negative Offset Range Requests on files.pythonhosted.org #12823

Closed EmmaJaneBonestell closed 1 year ago

EmmaJaneBonestell commented 1 year ago

Describe the bug While absolute range headers (e.g. bytes=0-64) work properly, negative values in a range header return a 501 not implemented error (with a nonstandard message stating "Unsupported client range", even though range requests are otherwise supported.)

Expected behavior Negative values are supposed to return a content-range offset from the end of the fille to the end. I.e. bytes=-64 should give content-range={content-length - 64}-{content-length}.

To Reproduce Any range request may illustrate this. Here's a curl command.

curl -i -H "Range: bytes=-64" "https://files.pythonhosted.org/packages/09/bd/2410905c76ee14c62baf69e3f4aa780226c1bbfc9485731ad018e35b0cb5/pip-22.3.1-py3-none-any.whl"

Returns:

HTTP/2 501 
...
content-type: text/html; charset=utf-8
accept-ranges: bytes
...
content-length: 470

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>501 Unsupported client range</title>
  </head>
  <body>
    <h1>Error 501 Unsupported client range</h1>
    <p>Unsupported client range</p>
    <h3>Error 54113</h3>
    <p>Details: (removed for privacy) </p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>

Should return:

one-any.whl"
HTTP/2 206 
...
x-goog-stored-content-encoding: identity
x-goog-stored-content-length: 2051534
content-type: application/octet-stream
accept-ranges: bytes
...
access-control-allow-methods: GET, OPTIONS
access-control-allow-headers: Range
access-control-allow-origin: *
x-pypi-file-python-version: py3
x-pypi-file-version: 22.3.1
x-pypi-file-package-type: bdist_wheel
x-pypi-file-project: pip
content-range: bytes 2051470-2051533/2051534
content-length: 64

Additional context If this is explicitly not implemented for some reason, please consider returning a 405 (Method Not Allowed) response instead, since 501 implies the request is not recognized. 501 is appropriate. Ignore me on that (:

ewdurbin commented 1 year ago

HTTP 501 indicates "Not Implemented" so that seems semantically correct. Range support is implemented by our CDN. I don't think we'd consider this a bug per se.

I'm not sure how much (if any) control we have over changing this response for negative offsets, so I don't think this is likely to change.

EmmaJaneBonestell commented 1 year ago

@ewdurbin I understand. Thanks for the prompt response anyway. (And apologies for filing as a bug; I wasn't going to, but when checking if it was already reported, I saw a similar issue filed as a bug.)

njsmith commented 1 year ago

@ewdurbin So I just ran into this as well, and..... I didn't change my code. So this worked a few weeks ago, but just stopped working?

And I just checked some other package repos that use Fastly as their CDN, and they seem to be supporting it fine. It's only pythonhosted.org that's erroring out:

``` ❯ http https://rubygems.org/downloads/fastly-4.0.0.gem range:bytes=-10 HTTP/1.1 206 Partial Content Accept-Ranges: bytes Age: 3618959 Cache-Control: max-age=31536000 Connection: keep-alive Content-Length: 10 Content-Range: bytes 911862-911871/911872 Content-Type: binary/octet-stream Date: Fri, 20 Jan 2023 11:13:18 GMT ETag: "5dafe4c4090d8356f01c40482d859083" Last-Modified: Fri, 09 Dec 2022 13:56:06 GMT Server: RubyGems.org Via: 1.1 varnish, 1.1 varnish X-Backend: F_S3 3.5.82.178:443, fastlyshield--shield_ssl_cache_bfi_krnt7300089_BFI 167.82.143.89:443 X-Cache: HIT, HIT X-Cache-Hits: 22, 0 X-Served-By: cache-bfi-krnt7300089-BFI, cache-sjc10027-SJC X-Timer: S1674213198.372793,VS0,VE1 x-amz-id-2: adakGQsWWTCKqcLh6RJBZ6R3jsaF9eUvEjP2yTypfC/SkrY29jhRlKxDje9bThF23wvVhjA0DdTubxFlxyUe6Q== x-amz-replication-status: COMPLETED x-amz-request-id: H8SZZ3N4PV0DFHEQ x-amz-version-id: PZEhRw5eG5oNtEeU6EQMfTUWBtcN3ENT +-----------------------------------------+ | NOTE: binary data not shown in terminal | +-----------------------------------------+ ~ ❯ http https://hackage.haskell.org/package/ghc-9.4.4/ghc-9.4.4.tar.gz range:bytes=-10 HTTP/1.1 206 Partial Content Accept-Ranges: bytes Age: 411 Cache-Control: public, no-transform, max-age=2592000 Connection: keep-alive Content-Length: 10 Content-MD5: wcAzB7pxOV0QGMYvRVGapw== Content-Range: bytes 5125255-5125264/5125265 Content-Type: application/x-gzip Date: Fri, 20 Jan 2023 11:13:23 GMT ETag: "c1c03307ba71395d1018c62f45519aa7" Last-modified: Sat, 31 Dec 2022 18:44:10 GMT Server: nginx/1.18.0 (Ubuntu) Via: 1.1 varnish X-Cache: HIT X-Cache-Hits: 0 X-Served-By: cache-sjc10072-SJC X-Timer: S1674213203.447684,VS0,VE0 +-----------------------------------------+ | NOTE: binary data not shown in terminal | +-----------------------------------------+ ~ ❯ http https://files.pythonhosted.org/packages/6e/77/7b69133bf0f3a6b0000cdb6133ff5292734182ca0cd107ad7ff4c46e7bc1/numpy-1.24.1-cp310-cp310-macosx_10_9_x86_64.whl range:bytes=-10 HTTP/1.1 501 Unsupported client range Accept-Ranges: bytes Connection: close Content-Length: 463 Content-Type: text/html; charset=utf-8 Date: Fri, 20 Jan 2023 11:13:41 GMT Retry-After: 0 Server: Varnish Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Cache: MISS X-Cache-Hits: 0 X-Content-Type-Options: nosniff X-Frame-Options: deny X-Permitted-Cross-Domain-Policies: none X-Robots-Header: noindex X-Served-By: cache-sjc10067-SJC X-Timer: S1674213222.925774,VS0,VE0 X-XSS-Protection: 1; mode=block 501 Unsupported client range

Error 501 Unsupported client range

Unsupported client range

Error 54113

Details: cache-sjc10067-SJC 1674213222 3968417666


Varnish cache server

```

So this seems to be a recent, PyPI-specific regression?

(Also as a side note, if there's an error here, it really should be 416 Range Not Satisfiable. The problem with 501 is that while it's not exactly wrong, it's a very generic "something is broken" error, so clients will usually give up, while code that's attempting range requests should have smart fallback logic for 416, like switching to requesting the whole file.)

di commented 1 year ago

@njsmith Any idea when it last worked? On Dec 12th we enabled segmented caching at the request of Fastly, and I wonder if this is a side effect.

ewdurbin commented 1 year ago

Hmmm, yep. Confirmed by reverting Segmented Caching rollout on test-files.pythonhosted.org temporarily, which restored negative offset support:

Before Segmented Caching:

$ curl -H"Range: -64" https://test-files.pythonhosted.org/packages/b7/b6/7e000d22d54f3c8939b366a0c85f2bbfcf973ce18b894a829b882a9109e5/hi-ml-0.1.1.post1159.tar.gz
Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

After:

$ curl -H"Range: -64" https://test-files.pythonhosted.org/packages/b7/b6/7e000d22d54f3c8939b366a0c85f2bbfcf973ce18b894a829b882a9109e5/hi-ml-0.1.1.post1159.tar.gz

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>501 Unsupported client range</title>
  </head>
  <body>
    <h1>Error 501 Unsupported client range</h1>
    <p>Unsupported client range</p>
    <h3>Error 54113</h3>
    <p>Details: cache-bur-kbur8200118-BUR 1674228917 2094377810</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>
ewdurbin commented 1 year ago

Fastly considered getting us moved over to Segmented Caching for files.pythonhosted.org urgent, so I don't see us reverting that.

dstufft commented 1 year ago

It's possible we can mention it to Fastly and it's something they might be able to implement on their end, but even if they were, no idea what time frame that would be.

njsmith commented 1 year ago

$ curl -H"Range: -64"

The (very confusing) syntax for range headers is Range: bytes=-64, so I think your pasted results are technically showing differences in error handling – the no-segmented-cache system responds to invalid Range: headers by sending the full body (a 200 response), while the segmented-cache system responds them by giving an error. But I think you're right that segmented caching must be the issue.

We probably should send Fastly a bug report about this? I feel like CDN companies generally like to know when their HTTP implementation has a bug, even if it's not a show-stopper :-).

BTW, the reason this matters is because it's useful for wheel METADATA fetches, like pip does here: https://github.com/pypa/pip/blob/main/src/pip/_internal/network/lazy_wheel.py To fetch a specific file out of a zip without downloading the whole thing, you start by reading the zip index that's at the end of the file, so the first thing you need to do is to fetch the end of the file. Pip's implementation does this by first doing a HEAD request to find out how long the file is, and then doing a Range request to get the end, but using two round trips for this is inefficient when you could accomplish both with a single negative range request. So... there is a workaround, and when PEP 658 (#8254) support lands it'll let us skip this whole dance anyway. And I guess it's lucky that pip was inefficient, because turning on Segmented Caching would have been a disaster!

(See also: https://github.com/gtsystem/python-remotezip/issues/15)

ewdurbin commented 1 year ago

Support request has been submitted to Fastly.

ewdurbin commented 1 year ago

Fastly support confirms that Segmented Caching does not support such negative offsets.

In addition they offered responses regarding suggested HTTP status codes for the response:

I got some details on the proposed responses from the linked Github page.

"405 Method Not Allowed" would not be the appropriate status code, its purpose is, quoting from https://www.rfc-editor.org/rfc/rfc7231#section-6.5.5, to indicate: that the method received in the request-line is known by the origin server but not supported by the target resource.

"416 Range Not Satisfiable" on the other hand would be valid, quoting from https://www.rfc-editor.org/rfc/rfc7233#section-4.4, it indicates: that none of the ranges in the request's Range header field (Section 3.1) overlap the current extent of the selected resource or that the set of ranges requested has been rejected due to invalid ranges or an excessive request of small or overlapping ranges.

Another valid handling would be to ignore an invalid Range and to send a status 200 response with the whole object, which is what e.g. apache httpd does in a case like this, but that would be a very wasteful response to what is basically an invalid request.

There is a wrinkle with sending a status 416 response because a valid status 416 response is supposed to have a Content-Range response header field with the size of the complete object filled in. That means that we need to access at least one fragment in order to be able to send a valid status 416 response. That again seems a bit wasteful a response to what is basically an invalid request.

ewdurbin commented 1 year ago

Seems our only recourse would be to write custom VCL to handle the requests, either

1) Disabling segmented caching for such requests, which would be problematic for Fastly (who asked us to enable it in the first place). 2) Disabling caching and passing these requests directly to our backends. Which would lead to much more traffic to arbitrary traffic to our origins, defeating the purpose of our CDN for files.pythonhosted.org. 3) Creating a synthetic response at the edge.

It seems the workaround making two requests (HEAD, then GET with Range:) is valid enough. Should we consider documenting that along side the lack of support for these negative offset requests and the potentially confusing error responses?

di commented 1 year ago

I'm trying to understand the use case here: is it just to retrieve metadata? Would working towards implementing PEP 658 be a better use of our time?

njsmith commented 1 year ago

That again seems a bit wasteful a response to what is basically an invalid request.

My engineer-brain insists that I point out: this is a weird way to describe an RFC-standard request that Fastly themselves support in their normal mode :-) Cf. RFC 7233 page 5:

   A client can request the last N bytes of the selected representation
   using a suffix-byte-range-spec.

     suffix-byte-range-spec = "-" suffix-length
     suffix-length = 1*DIGIT

is it just to retrieve metadata?

yes

Would working towards implementing PEP 658 be a better use of our time?

yes

dstufft commented 1 year ago

I'm going to close this. I don't think there's anything we can do here to move this issue forward, and PEP 658 is rolling out now which is a better way to handle this anyways.

cosmicexplorer commented 1 year ago

To corroborate the above, while pip will already prefer PEP 658 metadata when available, pypa/pip#12208 will fall back to performing a HEAD request to extract the Content-Length if it receives a 501 Not Implemented upon attempting a negative offset range (all according to the HTTP standard): see https://github.com/cosmicexplorer/pip/blob/e6f3f1aeec87fa70d4b8759478462e67a14b0a91/src/pip/_internal/network/lazy_wheel.py#L286-L303. Further range requests which explicitly specify the start and end of the range will then succeed. Non-pip resolvers which also want to use the fast-deps technique when PEP 658 metadata is not available are recommended to do the same.