rust-lang / simpleinfra

Rust Infrastructure automation
MIT License
146 stars 74 forks source link

Connection reset by peer on static.rust-lang.org #339

Closed fredr closed 11 months ago

fredr commented 1 year ago

Page(s) Affected

Most likely the same on all static.rust-lang.org, but this is the page I've been testing with: https://static.rust-lang.org/dist/2023-04-15/channel-rust-nightly.toml.sha256

What needs to be fixed?

We've noticed in the last couple of weeks that our CI pipelines started failing with "Connection reset by peer" when trying to install the nightly toolchain, specifically we've seen it when downloading these sha hashes.

This only happens when using IPV6.

I have noticed that static.rust-lang.org sometimes resolve to fastly and sometimes resolve to cloud front. From my testing this seem to only happens when it is resolved to fastly.

So my guess is the combination of fastly+ipv6 causes these errors.

The way I've been able to reproduce this issue, both from our build machines, but also from my computer, is to run a command like this:

while true; do curl -6 https://static.rust-lang.org/dist/2023-04-15/channel-rust-nightly.toml.sha256 --resolve 'static.rust-lang.org:443:2a04:4e42:200::649'; sleep 1; done

The -6 assures that ipv6 is used, and the --resolve assures that a specific fastly ip is used (but the problem have been noted on all the different fastly ips).

Depending on what network I'm on, the dns doesn't always resolve static.rust-lang.org to fastly, not sure if that is dependent on location or something else, but the above curl will resolve to a fastly ip directly.

❯ while true; do curl -6 https://static.rust-lang.org/dist/2023-04-15/channel-rust-nightly.toml.sha256; sleep 1; done
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
curl: (35) Recv failure: Connection reset by peer
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
curl: (35) Recv failure: Connection reset by peer
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml

The output from a failed curl with -vv added:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 2a04:4e42:600::649:443...
* Connected to static.rust-lang.org (2a04:4e42:600::649) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* OpenSSL SSL_connect: Connection reset by peer in connection to static.rust-lang.org:443
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Closing connection 0
* TLSv1.0 (OUT), TLS header, Unknown (21):
} [5 bytes data]
* TLSv1.3 (OUT), TLS alert, decode error (562):
} [2 bytes data]
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to static.rust-lang.org:443

PS. not sure if this is the correct place to report this?

Manishearth commented 1 year ago

cc @rust-lang/infra

No, this is just the main website. You can ask on t-infra on rust-lang.zulipchat.org

fredr commented 1 year ago

:+1: thanks, I'll open a topic there

jdno commented 1 year ago

Hi @fredr, thanks for reporting this! I moved the issue to the infra-team's repo and added it to our project board. Will try to reproduce this later today or tomorrow.

jdno commented 11 months ago

Hi @fredr,

Sorry for the long delay on this. I just tried reproducing the issue by running the command that you shared, but I don't get the connection resets from my network. 😬 Before diving deeper, do you still experience the issue or has it resolved itself over the past few weeks?

fredr commented 11 months ago

No worries, thanks for looking into it. I just ran the test, and got the reset after a few requests:

❯ while true; do curl -6 https://static.rust-lang.org/dist/2023-04-15/channel-rust-nightly.toml.sha256 --resolve 'static.rust-lang.org:443:2a04:4e42:200::649'; sleep 1; done
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
curl: (35) Recv failure: Connection reset by peer
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml

When you run it, did you run it with this exact resolve?

--resolve 'static.rust-lang.org:443:2a04:4e42:200::649'

Might be that the dns where you are resolves to a different addresses for static.rust-lang.org:? I get different lookup depending on where I am, but from a node in one of our datacenters I get these two:

$ dig aaaa static.rust-lang.org +short
fastly-static.rust-lang.org.
dualstack.k.sni.global.fastly.net.
2a04:4e42::649
2a04:4e42:200::649
2a04:4e42:400::649
2a04:4e42:600::649
$ dig aaaa static.rust-lang.org +short
cloudfront-static.rust-lang.org.
d3ah34wvbudrdd.cloudfront.net.
2600:9000:2334:f400:5:26a9:7440:93a1
2600:9000:2334:9e00:5:26a9:7440:93a1
2600:9000:2334:1c00:5:26a9:7440:93a1
2600:9000:2334:b000:5:26a9:7440:93a1
2600:9000:2334:e00:5:26a9:7440:93a1
2600:9000:2334:1400:5:26a9:7440:93a1
2600:9000:2334:1200:5:26a9:7440:93a1
2600:9000:2334:5000:5:26a9:7440:93a1

And I get the problem with all of those fastly ip addresses, but I havent been able to reproduce it with the cloudfront addresses.

jdno commented 11 months ago

I copy & pasted your command to make sure it's the same. And the DNS records resolve to the some addresses for me as well. 😕

Do your build machines run in the same network as your computer? Or do they share the same internet service provider?

fredr commented 11 months ago

TIL that we are our own ISP :open_mouth:, and we use the same from the office and the data center.

We did a bit of testing on our end from an other ISP, and with that we don't seem to have the problem.

The difference between the two are that ours route traffic via Arelion, and the other via NORDUnet. Feels like some kind of routing problem somewhere, potentially, hard to debug.

Maybe worth opening a issue with fastly, if you guys are fastly customers? otherwise, not the end of the world, we have added lots of retries, and will see if there is anything we can change in how we route traffic to fastly.

jdno commented 11 months ago

Oh that is a very interesting TIL! 😮 Makes me miss my days working in networking...

I'll forward this issue to our contacts at Fastly to see if they have any idea on how to further debug this. Sadly, though, I expect that there's little that we can do from our side to help out here. But let me confirm this first...

Good to hear that you found a workaround for now, though. 👍

supine commented 11 months ago

@fredr TCP sessions receiving resets when connecting to anycast addresses is usually a result of unstable ECMP.

Please double check for any ECMP load-balancing in your network and ensure it's configured as "per flow" with only source and destination IPs and ports used in the hash function.

If you continue to experience issues please contact support@fastly.com with the details and a link to this issue.

fredr commented 11 months ago

We have double checked our ECMP, and it was configured correctly (we also didn't have problems with other CDNs).

But, for whatever reason, we can no longer reproduce the problem, so hopefully it magically solved itself. I'll send an email to that address if it resurfaces.

Thank you both for looking into this! Much appreciated