microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
355 stars 27 forks source link

TCP connection issues between private link endpoint and ACA environment - offer of debugging help #1199

Open MatthewWilkes opened 2 weeks ago

MatthewWilkes commented 2 weeks ago

Please provide us with the following information:

This issue is a: (mark with an x)

Issue description

On two of our six ACA environments that are offered through a private link service, we had simultaneous elevated TCP connection failures.

This is an issue we explored with Azure support, which we initially thought was a FrontDoor failure, but we have since learnt was a failure related to the combination of private link and ACA apps.

Azure support was unable to resolve this problem, we have mitigated it by redeploying these environments in their entirety, however we do not know what the original cause was.

Steps to reproduce

We are not aware of any way to reproduce this, however we have two environments that exhibit this behaviour that have not yet been removed. We will remove these towards the end of next week, if you would like a chance to examine them before that, please contact me.

Expected behavior Connections work through private link reliably. An example connection session is:

mwilkes@debug-mw-1:~$ sudo tcpdump -nn host 10.0.0.5 &
[1] 3148
mwilkes@debug-mw-1:~$ tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes

mwilkes@debug-mw-1:~$ curl -vvIH "Host: intranet.REDACTED.uksouth.azurecontainerapps.io" https://10.0.0.5/
*   Trying 10.0.0.5:443...
* Connected to 10.0.0.5 (10.0.0.5) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
16:32:04.346708 IP 10.0.0.4.52236 > 10.0.0.5.443: Flags [S], seq 2475614277, win 64240, options [mss 1460,sackOK,TS val 3543838919 ecr 0,nop,wscale 7], length 0
16:32:04.348447 IP 10.0.0.5.443 > 10.0.0.4.52236: Flags [S.], seq 2152401610, ack 2475614278, win 65160, options [mss 1460,sackOK,TS val 3598468179 ecr 3543838919,nop,wscale 7], length 0
16:32:04.348479 IP 10.0.0.4.52236 > 10.0.0.5.443: Flags [.], ack 1, win 502, options [nop,nop,TS val 3543838921 ecr 3598468179], length 0
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: Connection reset by peer in connection to 10.0.0.5:443
* Closing connection 0
* TLSv1.0 (OUT), TLS header, Unknown (21):
* TLSv1.3 (OUT), TLS alert, decode error (562):
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to 10.0.0.5:443
mwilkes@debug-mw-1:~$ 16:32:04.396349 IP 10.0.0.4.52236 > 10.0.0.5.443: Flags [P.], seq 1:518, ack 1, win 502, options [nop,nop,TS val 3543838969 ecr 3598468179], length 517
16:32:04.398348 IP 10.0.0.5.443 > 10.0.0.4.52236: Flags [.], ack 518, win 506, options [nop,nop,TS val 3598468229 ecr 3543838969], length 0
16:32:04.398358 IP 10.0.0.5.443 > 10.0.0.4.52236: Flags [R.], seq 1, ack 518, win 506, options [nop,nop,TS val 3598468229 ecr 3543838969], length 0
16:32:04.485113 IP 10.0.0.5.443 > 10.0.0.4.56684: Flags [R.], seq 120628004, ack 0, win 0, length 0

Actual behavior SYN packets were not responded to with SYN/ACK, but intermittently with RST/ACK.

mwilkes@debug-mw-1:~$ sudo tcpdump -nn host 10.0.0.5 &
[1] 3142
mwilkes@debug-mw-1:~$ tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes

mwilkes@debug-mw-1:~$ curl -vvIH "Host: intranet.REDACTED.uksouth.azurecontainerapps.io" https://10.0.0.5/
*   Trying 10.0.0.5:443...
16:26:19.710195 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543494283 ecr 0,nop,wscale 7], length 0
16:26:20.739946 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543495313 ecr 0,nop,wscale 7], length 0
16:26:21.763941 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543496337 ecr 0,nop,wscale 7], length 0
16:26:22.787980 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543497361 ecr 0,nop,wscale 7], length 0
16:26:23.811983 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543498385 ecr 0,nop,wscale 7], length 0
16:26:24.739964 ARP, Request who-has 10.0.0.5 tell 10.0.0.4, length 28
16:26:24.740581 ARP, Reply 10.0.0.5 is-at 12:34:56:78:9a:bc, length 28
16:26:24.835929 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543499409 ecr 0,nop,wscale 7], length 0
16:26:26.851942 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543501425 ecr 0,nop,wscale 7], length 0
16:26:30.883986 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543505457 ecr 0,nop,wscale 7], length 0
16:26:36.230185 IP 10.0.0.5.443 > 10.0.0.4.51538: Flags [R.], seq 1539792900, ack 0, win 0, length 0
16:26:39.075983 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543513649 ecr 0,nop,wscale 7], length 0
16:26:44.467581 IP 10.0.0.5.443 > 10.0.0.4.51538: Flags [R.], seq 0, ack 1, win 0, length 0
16:26:55.203948 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543529777 ecr 0,nop,wscale 7], length 0
16:27:00.323978 ARP, Request who-has 10.0.0.5 tell 10.0.0.4, length 28
16:27:00.324684 ARP, Reply 10.0.0.5 is-at 12:34:56:78:9a:bc, length 28
16:27:00.682755 IP 10.0.0.5.443 > 10.0.0.4.51538: Flags [R.], seq 0, ack 1, win 0, length 0
16:27:27.716014 IP 10.0.0.4.51538 > 10.0.0.5.443: Flags [S], seq 1539792899, win 64240, options [mss 1460,sackOK,TS val 3543562289 ecr 0,nop,wscale 7], length 0
16:27:32.835957 ARP, Request who-has 10.0.0.5 tell 10.0.0.4, length 28
16:27:32.836701 ARP, Reply 10.0.0.5 is-at 12:34:56:78:9a:bc, length 28
16:27:33.092506 IP 10.0.0.5.443 > 10.0.0.4.51538: Flags [R.], seq 0, ack 1, win 0, length 0
* connect to 10.0.0.5 port 443 failed: Connection timed out
* Failed to connect to 10.0.0.5 port 443 after 133542 ms: Connection timed out
* Closing connection 0
curl: (28) Failed to connect to 10.0.0.5 port 443 after 133542 ms: Connection timed out

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context

Please see this diagram of connections that work and do not. The green lines imply good connectivity, the red poor. The VM at the bottom of the diagram is within the same vnet as the ACA app, so accesses it directly, the one at the top is in a different VNET.

image

Replacing the private link service and the private endpoint was not sufficient to restore service, which implies this is not a private link issue, but an issue with the combination of the ACA and the pl service.

Only upon replacing the two ACA envs did the problem end.

chinadragon0515 commented 1 week ago

@MatthewWilkes

  1. I want to confirm this is consumption only container app env, right?
  2. Can you help me to understand what's the IP 10.0.0.5 for? Is it IP of private endpoint for PLS? And what's the IP 10.0.0.4 for?
  3. What's the front end IP for load balancer kubernetes-internal? When you see issue access via PE, will you be able to access container app via load balancer IP directly?
  4. Does your case have access to container app directly? Do you see any issue when access container app via load balancer IP directly?
  5. can you send detail of your env information/ the exactly timestamp you meet issue to acasupport at microsoft dot com so we can check log to see whether the issue is ACA side .
MatthewWilkes commented 1 week ago

Hello @chinadragon0515, thanks for your message. I've updated the diagram with some extra information:

image

To answer your questions:

  1. Yes, this is a consumption only container app env
  2. Yes, 10.0.0.5 is the IP for the private endpoint of the PLS provisioned for testing in subscription C. They are both in the default subnet of the debug-mw-1-vnet virtual network, which is completely isolated from the application environments. 10.0.0.4 is the IP for the debug-mw-1 virtual machine in that vnet. Those are the items listed in the tcpdump capture.
  3. This is 10.1.2.152, in the snet-infrastructure subnet of the application vnet. We can access this directly from a virtual machine in the snet-management subnet of the application vnet. Confusingly, this happens to also have the IP 10.0.0.4, but all the packet dumps I shared above reference debug-mw-1, not this VM.
  4. The application environment is configured with internal=true, so it cannot be accessed from the public internet directly. Accesses from within the vnet do work reliably.
  5. Yes, I will send this.