problem fetching metadata

bewinsnw commented 1 month ago

I'm running in docker and just proxying to pypi: docker build -t simple . && docker run -it --rm -p 9191:8000 simple https://pypi.org/simple/

But pip errors out fetching metadata. Doing this by hand with curl you can see the metadata body does not stream:

$ curl -v http://localhost:9191/resources/pip/pip-24.1b1-py3-none-any.whl.metadata
*   Trying [::1]:9191...
* Connected to localhost (::1) port 9191
> GET /resources/pip/pip-24.1b1-py3-none-any.whl.metadata HTTP/1.1
> Host: localhost:9191
> User-Agent: curl/8.4.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< date: Fri, 24 May 2024 18:18:21 GMT
< server: uvicorn
< connection: keep-alive
< content-length: 1298
< server: nginx
< content-type: application/octet-stream
< last-modified: Mon, 06 May 2024 20:49:10 GMT
< etag: "71912d8b4ad8713b7a44242dd311c57a"
< x-amz-request-id: 6d382d8e8035375b
< x-amz-id-2: aN/djTDE5NtxmMzHtMHBkZWZjYwMwTzgz
< x-amz-version-id: 4_z179c51e67f11a0ad8f6c0018_f117117514458507d_d20240506_m204910_c005_v0501020_t0003_u01715028550339
< content-encoding: gzip
< cache-control: max-age=365000000, immutable, public
< accept-ranges: bytes
< date: Fri, 24 May 2024 18:18:22 GMT
< age: 705922
< x-served-by: cache-iad-kcgs7200170-IAD, cache-lcy-eglc8600033-LCY
< x-cache: HIT, HIT
< x-cache-hits: 74, 1
< x-timer: S1716574702.173255,VS0,VE2
< vary: Accept-Encoding
< strict-transport-security: max-age=31536000; includeSubDomains; preload
< x-frame-options: deny
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< x-robots-header: noindex
< access-control-allow-methods: GET, OPTIONS
< access-control-allow-headers: Range
< access-control-allow-origin: *
< x-pypi-file-python-version: py3
< x-pypi-file-version: 24.1b1
< x-pypi-file-package-type: bdist_wheel
< x-pypi-file-project: pip
< 
* transfer closed with 1298 bytes remaining to read
* Closing connection
curl: (18) transfer closed with 1298 bytes remaining to read

Here's the log from the container

INFO:     192.168.65.1:64969 - "GET /resources/pip/pip-24.1b1-py3-none-any.whl.metadata HTTP/1.1" 200 OK
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/simple/venv/lib/python3.11/site-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/home/simple/venv/lib/python3.11/site-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/home/simple/venv/lib/python3.11/site-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/home/simple/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
    await self.message_event.wait()
  File "/usr/local/lib/python3.11/asyncio/locks.py", line 213, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope ffff9991b490

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/home/simple/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/simple/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/simple/venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
  |     raise exc
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  |     raise exc
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
  |     await route.handle(scope, receive, send)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
  |     await self.app(scope, receive, send)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  |     raise exc
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/routing.py", line 75, in app
  |     await response(scope, receive, send)
  |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/responses.py", line 258, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/home/simple/venv/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/responses.py", line 261, in wrap
    |     await func()
    |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/responses.py", line 253, in stream_response
    |     await send({"type": "http.response.body", "body": chunk, "more_body": True})
    |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 50, in sender
    |     await send(message)
    |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 50, in sender
    |     await send(message)
    |   File "/home/simple/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 161, in _send
    |     await send(message)
    |   File "/home/simple/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 541, in send
    |     raise RuntimeError("Response content longer than Content-Length")
    | RuntimeError: Response content longer than Content-Length
    +------------------------------------

Curling the upstream directly (curl -v https://files.pythonhosted.org/packages/1e/65/22725f8ba583376d0c300c3b9b52b9a67cfd93d786a80be73c167e45abc8/pip-24.1b1-py3-none-any.whl.metadata) works just fine

I also looked at the size of the content returned from upstream vs what's in their content-length header, it seems fine. Not sure what it's complaining about, I'll dig in further later

bewinsnw commented 1 month ago

I see what's going on here now. simple-repository is sending accept headers allowing gzip compression; so the content-length header it gets back is the length of the gzipped body. But then when it streams the response, it's streaming the uncompressed response, which trips up uvicorn.

I dumped the headers inside http_response_iterator.py:

<CIMultiDictProxy('Connection': 'keep-alive', 'Content-Length': '1298', 'Server': 'nginx', 'Content-Type': 'application/octet-stream', 'Last-Modified': 'Mon, 06 May 2024 20:49:10 GMT', 'Etag': '"71912d8b4ad8713b7a44242dd311c57a"', 'x-amz-request-id': '6d382d8e8035375b', 'x-amz-id-2': 'aN/djTDE5NtxmMzHtMHBkZWZjYwMwTzgz', 'x-amz-version-id': '4_z179c51e67f11a0ad8f6c0018_f117117514458507d_d20240506_m204910_c005_v0501020_t0003_u01715028550339', 'Content-Encoding': 'gzip', 'Cache-Control': 'max-age=365000000, immutable, public', 'Accept-Ranges': 'bytes', 'Date': 'Sat, 25 May 2024 10:21:17 GMT', 'Age': '763696', 'X-Served-By': 'cache-iad-kcgs7200170-IAD, cache-lcy-eglc8600066-LCY', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '85, 1', 'X-Timer': 'S1716632477.124754,VS0,VE1', 'Vary': 'Accept-Encoding', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Frame-Options': 'deny', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'X-Robots-Header': 'noindex', 'Access-Control-Allow-Methods': 'GET, OPTIONS', 'Access-Control-Allow-Headers': 'Range', 'Access-Control-Allow-Origin': '*', 'x-pypi-file-python-version': 'py3', 'x-pypi-file-version': '24.1b1', 'x-pypi-file-package-type': 'bdist_wheel', 'x-pypi-file-project': 'pip')>

either we need to stream the raw response back from the server, or the 'Content-Length': '1298', ...'Content-Encoding': 'gzip', headers need dropped (along with accept-ranges, since simple-repository-server doesn't).

aiohttp.ClientSession can be called with auto_decompress=False, but this will cause the code elsewhere to fail: in fetch_simple_page it tries to treat a PEP-503 body as text, which it won't be if compression is on. Proxying the compressed response still compressed is also wrong when the client didn't request compression.

So, it's better to not proxy the headers except for a handful: Content-Type, Last-Modified, Etag, Cache-Control, Date, Age, Vary seem like a reasonable set - and rely on uvicorn to chunk the response. So in HttpResponseIterator I changed:

                iterator.status_code, iterator.headers = resp.status, resp.headers
                # The first time that anext is called, set stauts_code and

to

               iterator.status_code = resp.status
                proxy_headers = ["content-type", "last-modified", "etag", "cache-control", "date", "age", "vary"]
                iterator.headers = {k: v for k,v in resp.headers.items() if k.lower() in proxy_headers}
                # The first time that anext is called, set status_code and

The response headers from uvicorn were now:

< HTTP/1.1 200 OK
< date: Sat, 25 May 2024 10:55:24 GMT
< server: uvicorn
< content-type: application/octet-stream
< last-modified: Mon, 06 May 2024 20:49:10 GMT
< etag: "71912d8b4ad8713b7a44242dd311c57a"
< cache-control: max-age=365000000, immutable, public
< date: Sat, 25 May 2024 10:55:25 GMT
< age: 765745
< vary: Accept-Encoding
< transfer-encoding: chunked

and the metadata downloaded.

pelson commented 1 week ago

Thanks for the clear and reproducible example. This didn't get resolved with the move to httpx, and indeed the auto decompression option appears not to exist with httpx (https://github.com/encode/httpx/discussions/2220#discussion-4063893).

The ideal approach is that we pass the original request headers through to the proxied request, and then don't tamper with the results. In this way, we will also support range requests correctly.

pelson commented 6 days ago

Should now be resolved in v0.6.0. It is resolved by passing the request headers down to the child request. I had to hack httpx to avoid it decoding the response in the stream (now thoroughly tested). Please let me know how it works out for you with this release! Closing for now, but don't hesitate to re-open if not fully resolved.

simple-repository / simple-repository-server

problem fetching metadata #4