Open gjhenrique opened 3 weeks ago
I think it's actually a timing issue that happened before 1.58.2
. It's just harder to reproduce it, but I saw it in the wild today
Hi @gjhenrique thank you for the detailed issue description. So far I haven't been able to reproduce it- as you say it seems like this happens only occassionally. I will keep trying, but if you happen to see this again, would you be able to get logs from the proxy up till the map response long-poll timed out
error? Or, alternatively a bug report
Slightly similar https://github.com/tailscale/tailscale/issues/11519
Another thing on top of logs/bug report that would be good to find out what be how much data gets transferred via the egress proxy?
Hi @irbekrm
Thanks for jumping in
I will take a bug report when I see this in the wild
I have the logs previously recorded here:
2024/05/09 22:02:02 generating new machine key
2024/05/09 22:02:02 machine key written to store
2024/05/09 22:02:02 Backend: logs: be:a10f279a70cb186201e80a565736c9347170cac06c191cf9009b3beff7629e2f fe:
2024/05/09 22:02:02 control: client.Login(false, 0)
2024/05/09 22:02:02 control: doLogin(regen=false, hasUrl=false)
2024/05/09 22:02:02 health("overall"): error: not in map poll
boot: 2024/05/09 22:02:02 tailscaled in state "NoState", waiting
2024/05/09 22:02:02 control: control server key from https://controlplane.tailscale.com: ts2021=[fSeS+], legacy=[nlFWp]
2024/05/09 22:02:02 control: Generating a new nodekey.
2024/05/09 22:02:02 control: RegisterReq: onode= node=[4TQuE] fup=false nks=false
2024/05/09 22:02:03 control: RegisterReq: got response; nodeKeyExpired=false, machineAuthorized=true; authURL=false
2024/05/09 22:02:04 control: netmap: got new dial plan from control
2024/05/09 22:02:04 using tailnet default auto-update setting: false
2024/05/09 22:02:04 EditPrefs: MaskedPrefs{AutoUpdate={Apply=false}}
2024/05/09 22:02:04 active login: guilherme2-coverservice.mau-pauling.ts.net
2024/05/09 22:02:04 Switching ipn state NoState -> Starting (WantRunning=true, nm=true)
2024/05/09 22:02:04 magicsock: SetPrivateKey called (init)
2024/05/09 22:02:04 wgengine: Reconfig: configuring userspace WireGuard config (with 0/24 peers)
2024/05/09 22:02:04 wgengine: Reconfig: starting up network logger (node:4a25afd81bced37ec92881e172853d5916e2d152c1c0508a7260bb091d3afcf4 tailnet:e044205ed4c9c1a1da0cbe5be4bf90a36ccd96208b4b6832921b620ebd301618)
2024/05/09 22:02:04 wgengine: Reconfig: configuring router
boot: 2024/05/09 22:02:04 tailscaled in state "Starting", waiting
2024/05/09 22:02:04 wgengine: Reconfig: configuring DNS
2024/05/09 22:02:04 dns: Set: {DefaultResolvers:[] Routes:{} SearchDomains:[] Hosts:25}
2024/05/09 22:02:04 dns: Resolvercfg: {Routes:{} Hosts:25 LocalDomains:[], cloud="aws"}
2024/05/09 22:02:04 dns: OScfg: {}
2024/05/09 22:02:04 peerapi: serving on http://100.93.57.60:38348
2024/05/09 22:02:04 peerapi: serving on http://[fd7a:115c:a1e0::a101:393c]:49688
2024/05/09 22:02:04 magicsock: home is now derp-4 (fra)
2024/05/09 22:02:04 magicsock: adding connection to derp-4 for home-keep-alive
2024/05/09 22:02:04 control: NetInfo: NetInfo{varies=true hairpin=false ipv6=false ipv6os=true udp=true icmpv4=false derp=#4 portmap= link="" firewallmode="ipt-forced"}
2024/05/09 22:02:04 magicsock: 1 active derp conns: derp-4=cr0s,wr0s
2024/05/09 22:02:04 magicsock: endpoints changed: 18.153.213.153:33504 (stun), 10.122.3.206:50071 (local)
2024/05/09 22:02:04 Switching ipn state Starting -> Running (WantRunning=true, nm=true)
2024/05/09 22:02:04 derphttp.Client.Connect: connecting to derp-4 (fra)
boot: 2024/05/09 22:02:04 [v1] kernel supports IPv6 policy routing (found 6 rules)
boot: 2024/05/09 22:02:04 v6 = true, v6filter = true, v6nat = true
boot: 2024/05/09 22:02:04 Installing forwarding rules for destination 100.95.173.103
2024/05/09 22:02:04 wgengine: Reconfig: configuring userspace WireGuard config (with 1/24 peers)
2024/05/09 22:02:04 Accept: UDP{100.93.57.60:3478 > 100.95.173.103:50071} 72 ok out
2024/05/09 22:02:04 magicsock: adding connection to derp-11 for [d6vHL]
2024/05/09 22:02:04 magicsock: 2 active derp conns: derp-4=cr19ms,wr19ms derp-11=cr0s,wr0s
2024/05/09 22:02:04 derphttp.Client.Send: connecting to derp-11 (sao)
2024/05/09 22:02:04 Accept: UDP{100.93.57.60:14187 > 100.95.173.103:50071} 72 ok out
boot: 2024/05/09 22:02:04 Not installing egress forwarding rules for IPv6 as this is currently not supported
2024/05/09 22:02:04 magicsock: derp-4 connected; connGen=1
2024/05/09 22:02:04 health("overall"): ok
2024/05/09 22:02:04 Accept: UDP{100.93.57.60:19137 > 100.95.173.103:50071} 72 ok out
boot: 2024/05/09 22:02:04 Startup complete, waiting for shutdown signal
2024/05/09 22:02:05 magicsock: derp-11 connected; connGen=1
2024/05/09 22:02:05 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:09 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:14 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:17 Accept: TCP{100.93.57.60:80 > 100.95.173.103:56820} 143 ok out
2024/05/09 22:02:20 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:25 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:30 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:02:30 magicsock: endpoints changed: 18.153.213.153:45584 (stun), 18.153.213.153:33504 (stun), 10.122.3.206:50071 (local)
2024/05/09 22:04:03 control: map response long-poll timed out!
2024/05/09 22:04:03 control: lite map update error after 2m0.001s: Post "https://controlplane.tailscale.com/machine/map": context canceled
2024/05/09 22:04:04 control: map response long-poll timed out!
2024/05/09 22:04:04 Received error: PollNetMap: context canceled
2024/05/09 22:04:05 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:04:07 Accept: TCP{100.93.57.60:443 > 100.95.173.103:36588} 40 ok out
2024/05/09 22:04:11 [unexpected] magicsock: derp-11 does not know about peer [d6vHL], removing route
2024/05/09 22:04:11 health("overall"): error: no map response in 2m7s
I was busy with some other topics, and couldn't reply back earlier with an interesting piece of information. Since you jumped in, I can show those and maybe you could have some better ideas how to follow.
When seeing the context deadline exceeded, I saw the packets from the control plane to the host going to the FORWARD
chain :confused:
Doing a conntrack -E
on a pod and starting the connection from scratch, I could see these entries
The 3.78.132.46
IP is from controlplane.tailscale.com
.
WORKING
[NEW] tcp 6 300 ESTABLISHED src=10.122.1.152 dst=3.78.132.46 sport=37230 dport=80 [UNREPLIED] src=3.78.132.46 dst=10.122.1.152 sport=80 dport=37230
BROKEN
[NEW] tcp 6 300 ESTABLISHED src=3.78.132.46 dst=10.122.8.137 sport=80 dport=59974 [UNREPLIED] src=100.85.51.113 dst=100.104.164.14 sport=59974 dport=80
In the broken cases, it registers the packet coming from control plane as part of a NAT connection. Check the part after UNREPLIED
.
I have a strong suspicion that Tailscale only activates the nf_conntrack
module later after the initial TCP handshake, and it loses the track of who started what.
Maybe the problem doesn't reproduce in other cases because of the order of packets (when the first packet comes from local to controlplane) or when nf_conntrack
is activated earlier. I need to validate these assumptions
Next steps would be:
nf_conntrack
is activated by the other Tailscale rules soonernf_conntrack
earlier and we don't need to do anything. The issue was that 1.66 is breaking existing clients because of authentication issues probably because of a new environment variable. I would need to delete all of them from scratch and let the operator create new ones from scratch.
What is the issue?
Hi,
Since we bumped Tailscale from 1.58.2, the egress proxy pods loses the connection sometimes. We receive the dreaded
map response long-poll timed out!
and the machine shows the offline status.I could pinpoint to some facts:
tcpdump
to check the second connection and it looks like it reuses the first one. Only the client sends PUSH flags, but no response from controlplane. Probably something weird with the FW blocking those packets. 10.122.1.124.35913 > 3.73.239.57.80: Flags [P.], seq 1:1360, ack 1, win 443, options [nop,nop,TS val 974238051 ecr 2051401776], length 1359: HTTP 10.122.1.124.35913 > 3.73.239.57.80: Flags [P.], seq 1:1360, ack 1, win 443, options [nop,nop,TS val 974238915 ecr 2051401776], length 1359: HTTPTurning on the assumption mode: The
DNAT
rule thatcontainerboot
is injecting impacts the long-polling connection somehow between the client and the controlplane server. I couldn't find any commits that could pinpint something breaking between 1.58.2 to 1.60.1, so I assume it's some timing issue that now matches the ongoing connection with tailscale inserting that rule.An acceptable fix would be if TS created a new connection when
direct.go
cancels the context. This is not the case and the process is not "self-healing".Apparently, the blocking happens on the client, so it's going to be a bit hard for you to investigate only on the server, so I'm open for a call if you want it.
Steps to reproduce
Deploy an egress proxy pod with
.1.60.1
. It's hard to reproduce this issue, so you might need to kill the pod sometimes.Are there any recent changes that introduced the issue?
Not that I'm aware of
OS
Linux
OS version
Ubuntu 22.04
Tailscale version
1.60.2
Other software
No response
Bug report
No response