Egress proxy pod loses connection with controlmap to receive netmap updates

gjhenrique commented 3 weeks ago

What is the issue?

Hi,

Since we bumped Tailscale from 1.58.2, the egress proxy pods loses the connection sometimes. We receive the dreaded map response long-poll timed out! and the machine shows the offline status.

I could pinpoint to some facts:

It only affects the egress proxy pods. Ingress proxy pods are never affected
I can reproduce from 1.60.2 onwards, but success happens more often than failure, making it hard to test.
The bug happens with iptables and nftables
The connection resets after 15 minutes, and then everything works out correctly because a new connection is made.
After two minutes, it tries to make a new connection, which would be helpful, but it looks like it reuses the broken connection. I used tcpdump to check the second connection and it looks like it reuses the first one. Only the client sends PUSH flags, but no response from controlplane. Probably something weird with the FW blocking those packets. 10.122.1.124.35913 > 3.73.239.57.80: Flags [P.], seq 1:1360, ack 1, win 443, options [nop,nop,TS val 974238051 ecr 2051401776], length 1359: HTTP 10.122.1.124.35913 > 3.73.239.57.80: Flags [P.], seq 1:1360, ack 1, win 443, options [nop,nop,TS val 974238915 ecr 2051401776], length 1359: HTTP
I can reproduce the bug by only having the PREROUTING rule. If I comment this line, I can't reproduce it anymore.

Turning on the assumption mode: The DNAT rule that containerboot is injecting impacts the long-polling connection somehow between the client and the controlplane server. I couldn't find any commits that could pinpint something breaking between 1.58.2 to 1.60.1, so I assume it's some timing issue that now matches the ongoing connection with tailscale inserting that rule.

An acceptable fix would be if TS created a new connection when direct.go cancels the context. This is not the case and the process is not "self-healing".

Apparently, the blocking happens on the client, so it's going to be a bit hard for you to investigate only on the server, so I'm open for a call if you want it.

Steps to reproduce

Deploy an egress proxy pod with .1.60.1. It's hard to reproduce this issue, so you might need to kill the pod sometimes.

Are there any recent changes that introduced the issue?

Not that I'm aware of

OS

Linux

OS version

Ubuntu 22.04

Tailscale version

1.60.2

Other software

No response

Bug report

No response

gjhenrique commented 2 weeks ago

I think it's actually a timing issue that happened before 1.58.2. It's just harder to reproduce it, but I saw it in the wild today

irbekrm commented 2 weeks ago

Hi @gjhenrique thank you for the detailed issue description. So far I haven't been able to reproduce it- as you say it seems like this happens only occassionally. I will keep trying, but if you happen to see this again, would you be able to get logs from the proxy up till the map response long-poll timed out error? Or, alternatively a bug report

irbekrm commented 1 week ago

Another thing on top of logs/bug report that would be good to find out what be how much data gets transferred via the egress proxy?

gjhenrique commented 1 week ago

Hi @irbekrm

Thanks for jumping in

I will take a bug report when I see this in the wild

I have the logs previously recorded here:

2024/05/09 22:02:02 generating new machine key
2024/05/09 22:02:02 machine key written to store
2024/05/09 22:02:02 Backend: logs: be:a10f279a70cb186201e80a565736c9347170cac06c191cf9009b3beff7629e2f fe:
2024/05/09 22:02:02 control: client.Login(false, 0)
2024/05/09 22:02:02 control: doLogin(regen=false, hasUrl=false)
2024/05/09 22:02:02 health("overall"): error: not in map poll
boot: 2024/05/09 22:02:02 tailscaled in state "NoState", waiting
2024/05/09 22:02:02 control: control server key from https://controlplane.tailscale.com: ts2021=[fSeS+], legacy=[nlFWp]
2024/05/09 22:02:02 control: Generating a new nodekey.
2024/05/09 22:02:02 control: RegisterReq: onode= node=[4TQuE] fup=false nks=false
2024/05/09 22:02:03 control: RegisterReq: got response; nodeKeyExpired=false, machineAuthorized=true; authURL=false
2024/05/09 22:02:04 control: netmap: got new dial plan from control
2024/05/09 22:02:04 using tailnet default auto-update setting: false
2024/05/09 22:02:04 EditPrefs: MaskedPrefs{AutoUpdate={Apply=false}}
2024/05/09 22:02:04 active login: guilherme2-coverservice.mau-pauling.ts.net
2024/05/09 22:02:04 Switching ipn state NoState -> Starting (WantRunning=true, nm=true)
2024/05/09 22:02:04 magicsock: SetPrivateKey called (init)
2024/05/09 22:02:04 wgengine: Reconfig: configuring userspace WireGuard config (with 0/24 peers)
2024/05/09 22:02:04 wgengine: Reconfig: starting up network logger (node:4a25afd81bced37ec92881e172853d5916e2d152c1c0508a7260bb091d3afcf4 tailnet:e044205ed4c9c1a1da0cbe5be4bf90a36ccd96208b4b6832921b620ebd301618)
2024/05/09 22:02:04 wgengine: Reconfig: configuring router
boot: 2024/05/09 22:02:04 tailscaled in state "Starting", waiting
2024/05/09 22:02:04 wgengine: Reconfig: configuring DNS
2024/05/09 22:02:04 dns: Set: {DefaultResolvers:[] Routes:{} SearchDomains:[] Hosts:25}
2024/05/09 22:02:04 dns: Resolvercfg: {Routes:{} Hosts:25 LocalDomains:[], cloud="aws"}
2024/05/09 22:02:04 dns: OScfg: {}
2024/05/09 22:02:04 peerapi: serving on http://100.93.57.60:38348
2024/05/09 22:02:04 peerapi: serving on http://[fd7a:115c:a1e0::a101:393c]:49688
2024/05/09 22:02:04 magicsock: home is now derp-4 (fra)
2024/05/09 22:02:04 magicsock: adding connection to derp-4 for home-keep-alive
2024/05/09 22:02:04 control: NetInfo: NetInfo{varies=true hairpin=false ipv6=false ipv6os=true udp=true icmpv4=false derp=#4 portmap= link="" firewallmode="ipt-forced"}
2024/05/09 22:02:04 magicsock: 1 active derp conns: derp-4=cr0s,wr0s
2024/05/09 22:02:04 magicsock: endpoints changed: 18.153.213.153:33504 (stun), 10.122.3.206:50071 (local)
2024/05/09 22:02:04 Switching ipn state Starting -> Running (WantRunning=true, nm=true)
2024/05/09 22:02:04 derphttp.Client.Connect: connecting to derp-4 (fra)
boot: 2024/05/09 22:02:04 [v1] kernel supports IPv6 policy routing (found 6 rules)
boot: 2024/05/09 22:02:04 v6 = true, v6filter = true, v6nat = true
boot: 2024/05/09 22:02:04 Installing forwarding rules for destination 100.95.173.103
2024/05/09 22:02:04 wgengine: Reconfig: configuring userspace WireGuard config (with 1/24 peers)
2024/05/09 22:02:04 Accept: UDP{100.93.57.60:3478 > 100.95.173.103:50071} 72 ok out
2024/05/09 22:02:04 magicsock: adding connection to derp-11 for [d6vHL]
2024/05/09 22:02:04 magicsock: 2 active derp conns: derp-4=cr19ms,wr19ms derp-11=cr0s,wr0s
2024/05/09 22:02:04 derphttp.Client.Send: connecting to derp-11 (sao)
2024/05/09 22:02:04 Accept: UDP{100.93.57.60:14187 > 100.95.173.103:50071} 72 ok out
boot: 2024/05/09 22:02:04 Not installing egress forwarding rules for IPv6 as this is currently not supported
2024/05/09 22:02:04 magicsock: derp-4 connected; connGen=1
2024/05/09 22:02:04 health("overall"): ok
2024/05/09 22:02:04 Accept: UDP{100.93.57.60:19137 > 100.95.173.103:50071} 72 ok out
boot: 2024/05/09 22:02:04 Startup complete, waiting for shutdown signal
2024/05/09 22:02:05 magicsock: derp-11 connected; connGen=1
2024/05/09 22:02:05 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:09 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:14 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:17 Accept: TCP{100.93.57.60:80 > 100.95.173.103:56820} 143 ok out
2024/05/09 22:02:20 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:25 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:30 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:30 magicsock: endpoints changed: 18.153.213.153:45584 (stun), 18.153.213.153:33504 (stun), 10.122.3.206:50071 (local)

2024/05/09 22:04:03 control: map response long-poll timed out!
2024/05/09 22:04:03 control: lite map update error after 2m0.001s: Post "https://controlplane.tailscale.com/machine/map": context canceled
2024/05/09 22:04:04 control: map response long-poll timed out!
2024/05/09 22:04:04 Received error: PollNetMap: context canceled
2024/05/09 22:04:05 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:04:07 Accept: TCP{100.93.57.60:443 > 100.95.173.103:36588} 40 ok out
2024/05/09 22:04:11 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:04:11 health("overall"): error: no map response in 2m7s

gjhenrique commented 1 week ago

I was busy with some other topics, and couldn't reply back earlier with an interesting piece of information. Since you jumped in, I can show those and maybe you could have some better ideas how to follow.

When seeing the context deadline exceeded, I saw the packets from the control plane to the host going to the FORWARD chain :confused:

Doing a conntrack -E on a pod and starting the connection from scratch, I could see these entries

The 3.78.132.46 IP is from controlplane.tailscale.com.

WORKING
[NEW] tcp      6 300 ESTABLISHED src=10.122.1.152 dst=3.78.132.46 sport=37230 dport=80 [UNREPLIED] src=3.78.132.46 dst=10.122.1.152 sport=80 dport=37230

BROKEN
[NEW] tcp      6 300 ESTABLISHED src=3.78.132.46 dst=10.122.8.137 sport=80 dport=59974 [UNREPLIED] src=100.85.51.113 dst=100.104.164.14 sport=59974 dport=80

In the broken cases, it registers the packet coming from control plane as part of a NAT connection. Check the part after UNREPLIED.

I have a strong suspicion that Tailscale only activates the nf_conntrack module later after the initial TCP handshake, and it loses the track of who started what.

Maybe the problem doesn't reproduce in other cases because of the order of packets (when the first packet comes from local to controlplane) or when nf_conntrack is activated earlier. I need to validate these assumptions

Next steps would be:

I actually lied on the bug report description. The set of hosts that run TS are on Ubuntu 20.04, and not 22.04. Maybe bumping them would get a fresh non-faulty netfilter module
I was going to experiment adding some sleep on the containerboot firewall logic. Maybe nf_conntrack is activated by the other Tailscale rules sooner
Going to 1.66 would get the new stateful filtering activates nf_conntrack earlier and we don't need to do anything. The issue was that 1.66 is breaking existing clients because of authentication issues probably because of a new environment variable. I would need to delete all of them from scratch and let the operator create new ones from scratch.

tailscale / tailscale