Open paulswartz opened 3 months ago
I'd love a repro case here, not of the application itself, but of the configuration it took within AWS to get here! I'm willing to bet it rhymes with #306 and/or #338; likely a bug in who / how content-length headers get set.
For us, it's:
HTTP2
target version to talk to the ECS Fargate tasksGood response:
Bad response:
Server logs:
023cd54e6718 22:47:13.750 [info] state=sent method=GET path=/assets/app-b49524acc1698e2440c53a73d41aec25.js params=[UNFETCHED] status=200 headers={"user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:129.0) Gecko/20100101 Firefox/129.0"} duration=0.248
023cd54e6718 22:47:14.649 [info] state=sent method=GET path=/assets/app-b49524acc1698e2440c53a73d41aec25.js params=[UNFETCHED] status=200 headers={"user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:129.0) Gecko/20100101 Firefox/129.0"} duration=0.234
From cURL:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Host XXX:443 was resolved.
* IPv6: (none)
* IPv4: XXX, XXX, XXX
* Trying XXX:443...
* Connected to XXX (XXX) port 443
* ALPN: curl offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [104 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [4959 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256 / secp256r1 / rsaEncryption
* ALPN: server accepted h2
* Server certificate:
* subject: CN=XXX
* start date: Mar 27 00:00:00 2024 GMT
* expire date: Apr 25 23:59:59 2025 GMT
* subjectAltName: host "XXX" matched cert's "XXX"
* issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M03
* SSL certificate verify ok.
* Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
* Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
* Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
{ [5 bytes data]
* using HTTP/2
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* [HTTP/2] [1] OPENED stream for XXX/assets/app-b49524acc1698e2440c53a73d41aec25.js?vsn=d
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: XXX]
* [HTTP/2] [1] [:path: /assets/app-b49524acc1698e2440c53a73d41aec25.js?vsn=d]
* [HTTP/2] [1] [user-agent: curl/8.9.1]
* [HTTP/2] [1] [accept: */*]
} [5 bytes data]
> GET /assets/app-b49524acc1698e2440c53a73d41aec25.js?vsn=d HTTP/2
> Host: XXX
> User-Agent: curl/8.9.1
> Accept: */*
>
* Request completely sent off
{ [5 bytes data]
< HTTP/2 200
< server: awselb/2.0
< date: Tue, 20 Aug 2024 22:59:26 GMT
< content-type: text/javascript
< content-length: 270919
< vary: Accept-Encoding
< cache-control: public, max-age=31536000, immutable
< accept-ranges: bytes
<
{ [8192 bytes data]
* HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)
90 264k 90 238k 0 0 870k 0 --:--:-- --:--:-- --:--:-- 870k
* Connection #0 to host XXX left intact
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)
Thanks! That's super helpful. Any notable logs on the bandit side?
Nope, nothing from Bandit other than the two request logs. If it's helpful, they're from Logster.Plug and run as a Plug.Conn.register_before_send/2
callback.
I've been trying most of the afternoon to reproduce this and I can't manage to get it working. I can get a repro of a somewhat related issue that I'm noting on a separate comment here, but it manifests as 500s downstream and not truncated responses.
It would also be useful to know if:
In trying to repro this issue I managed to find a possibly related issue, wherein an ALB target group configured to use HTTP/2 fails to make multiple requests on a single upstream connection. Some details:
Unsure how to investigate this further. Without visibility into ALB's upstream client code, or into the errors it's seeing, we're kinda going blind here. If I had to hunch based on the symptoms at hand it may be related to response header compression contexts, since this is one of the only things shared between streams within in a connection that would affect subsequent streams and not the original one, and also doesn't vary with response size as windows do.
Got it. It's a header compression bug of some sort, likely upstream in HPAX (or possibly in AWS' HTTP/2 client implementation). Working to characterize it more precisely but it's 100% reproducible.
Here's a tcpdump capture of a hacked up version of bandit that doesn't return any headers (other than the :status: 200
pseudo header which is statically defined in HPACK). Note that the client keeps the connection open, as indicated by its continuing to make subsequent requests on streams within the same connection. So obviously, the ALB client has no issues with this.
Now, here's a tcpdump capture of a an otherwise identical version of bandit that adds two copies of the header foo: bar
to the same request. You can see that ALB client resets the connection immediately, suggesting that it sees something wrong here that doesn't come up in the previous case.
(FWIW, I've also run tests with another statically defined header (accept-encoding: gzip, deflate
) and it also succeeded. So it's almost certainly something with either the way we encode literal headers, or the way ALB decodes them).
Of note, wireshark only shows one copy of the foo: bar
header in its decode (and in the case of subsequent stream requests with similar headers, omits them there as well). This suggests that wireshark may have issues with our encoding too.
Fixed it. We were resizing the HPAX compression table contents but not its max size when we received client-side settings frames (this manifests here because the ALB client sets its max header table size to zero as part of its initial settings sync).
A hacky fix is up at https://github.com/mtrudel/bandit/tree/aws_investigation if you're so inclined to test. I'm gonna bring this up with @whatyouhide to see if this makes more sense to have as a function on HPAX itself before landing a proper fix on Bandit.
~To that end: @whatyouhide, I think it makes more sense to have a HPAX.change_max_size/2
function that both evicts from the table and also updates the max table size as indicated. It could also make sense as a flag to HPAX.resize/2
. Happy to raise an issue over there / do the work for this.~ Created https://github.com/elixir-mint/hpax/issues/18 to better capture the discussion
I've also failed to switch ALB to h2 when talking to bandit in a very similar setup (ALB to ECS containers) and might be able to test this.
Can confirm it works in my setup AWS ALB with HTTP/2 enabled for the backend to a target group of ECS containers.
Can confirm it works in my setup AWS ALB with HTTP/2 enabled for the backend to a target group of ECS containers.
Great to hear a second opinion! I'm going to keep working it over on HPAX
Great to hear a second opinion! I'm going to keep working it over on HPAX
My app is using phoenix channels / live_view and upgrade to wss over h2c returns 464. I guess bandit does not support websockets over h2c?
My app is using phoenix channels / live_view and upgrade to wss over h2c returns 464. I guess bandit does not support websockets over h2c?
Correct. RFC 8441 support is coming 'soon' (in fact, most of the refactoring work I've been doing for the past several months has been with the express purpose of supporting it). No promises, but likely some time this winter.
Correct. RFC 8441 support is coming 'soon' (in fact, most of the refactoring work I've been doing for the past several months has been with the express purpose of supporting it). No promises, but likely some time this winter.
Sorry, I guess the wording was a bit off, It was not a feature request 🙂
It's a missing piece to support phoenix apps behind a proxy doing TLS termination.
I recently tried running a Bandit-based Phoenix application with H2C, behind an AWS Application Load Balancer. In this mode, I would almost never receive a full response, even for static files served by
Plug.Static
.Some alternatives which did work, making me thing that the issue is with Bandit's H2C implementation:
I don't have much in the way of logs at the moment, but if there's anything that would be helpful, I'm happy to give it a shot!