docker pull failed with `connection reset by peer` or `i/o timeout`

leomao10 commented 1 year ago

Hi there,

I am Leo Liang from Bitbucket Pipelines team have had several users report failures trying to pull images from mcr.microsoft.com. We done some analysis of our logs and have found a combination of "connection reset by peer" and "i/o timeout" errors talking to both mcr.microsoft.com and eastus.data.mcr.microsoft.com.

The majority of errors are for mcr.microsoft.com, and mainly happen in our node in AWS us-east-1 region, and it constantly happen and we don't see abnormal spike of error rate.

Here is one of the tcp dump we capture from one of the failing build:

17:57:25.224404 IP 0cdf910b-4c56-4622-8622-3fccc7cbf3c5-nx7hm.57338 > 204.79.197.219.443: Flags [S], seq 2044995150, win 64240, options [mss 1460,sackOK,TS val 4164622281 ecr 0,nop,wscale 7], length 0
17:57:25.226511 IP 204.79.197.219.443 > 0cdf910b-4c56-4622-8622-3fccc7cbf3c5-nx7hm.57338: Flags [S.], seq 496447486, ack 2044995151, win 65535, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0
17:57:25.226533 IP 0cdf910b-4c56-4622-8622-3fccc7cbf3c5-nx7hm.57338 > 204.79.197.219.443: Flags [.], ack 1, win 502, length 0
17:57:25.226802 IP 0cdf910b-4c56-4622-8622-3fccc7cbf3c5-nx7hm.57338 > 204.79.197.219.443: Flags [P.], seq 1:250, ack 1, win 502, length 249
17:57:25.228257 IP 204.79.197.219.443 > 0cdf910b-4c56-4622-8622-3fccc7cbf3c5-nx7hm.57338: Flags [.], ack 250, win 16384, length 0
17:57:25.229435 IP 204.79.197.219.443 > 0cdf910b-4c56-4622-8622-3fccc7cbf3c5-nx7hm.57338: Flags [P.], seq 1:5929, ack 250, win 16384, length 5928
17:57:25.229455 IP 0cdf910b-4c56-4622-8622-3fccc7cbf3c5-nx7hm.57338 > 204.79.197.219.443: Flags [.], ack 5929, win 456, length 0
17:57:25.233892 IP 0cdf910b-4c56-4622-8622-3fccc7cbf3c5-nx7hm.57338 > 204.79.197.219.443: Flags [P.], seq 250:408, ack 5929, win 501, length 158
17:57:25.234814 IP 204.79.197.219.443 > 0cdf910b-4c56-4622-8622-3fccc7cbf3c5-nx7hm.57338: Flags [R], seq 496453415, win 0, length 0

Base on our understanding:

It started ok when Pipelines started the handshake and Microsoft acknowledged it as well
On the initial push of data from Pipelines, it was also acknowledged by Microsoft (see sequence 1:250)
However, after the second push of data from Pipelines (see sequence 250:408), it was not acknowledged by Microsoft but immediately terminated instead with the [R] flag which means the connection is reset/terminated abruptly. (see sequence 496453415)

Do you aware any existing networking issue between AWS and mcr.microsoft.com?

leomao10 commented 1 year ago

One thing we notice is all the failed events for last 7 days are related to ip 204.79.197.219:443

And we noticed there are similar networking issue mentioned this ip address, not sure if they are related : https://github.com/microsoft/containerregistry/issues/139 https://github.com/dotnet/core/issues/8268

sjg99 commented 1 year ago

Having the same issue with a kubernetes cluster hosted in AWS(us-east-1 also) at the moment of building the container.

Retrieving image mcr.microsoft.com/dotnet/sdk:6.0 from registry mcr.microsoft.com error building image: Get "https://mcr.microsoft.com/v2/dotnet/sdk/blobs/sha256:7d987f8db5482ed7d3fe8669b1cb791fc613d25e04a6cc31eed37677a6091a29": read tcp 10.0.2.246:55774->204.79.197.219:443: read: connection reset by peer

During last week was more ocasional that a pipeline would fail, but the last two days in my case has become almost permanent

akhtar-h-m commented 1 year ago

We are seeing similar issues on the last few days. Seems to fail maybe 25% of the time, so could be so load balanced node behind the ip that isat issue and if you just happened to hit it, then it doesn't work. Some of builds are trying to pull multiple times, so the overall builds fail a lot as it doesn't consitsently work across the whole build

devopshangikredi commented 1 year ago

Hello folks;

We have same issue as well since a couple of days. Here is the ERROR message from Jenkins pipeline:

   1 | >>> FROM mcr.microsoft.com/dotnet/aspnet:7.0-bullseye-slim AS base
   2 |     WORKDIR /app
   3 |     EXPOSE 80
--------------------
ERROR: failed to solve: mcr.microsoft.com/dotnet/aspnet:7.0-bullseye-slim: pulling from host mcr.microsoft.com failed with status code [manifests 7.0-bullseye-slim]: 503 Service Unavailable

By the way; issue is intermittent; we hit this issue %20 of the time.

For your information.