telepresenceio / telepresence

Local development against a remote Kubernetes or OpenShift cluster
https://www.telepresence.io
Other
6.61k stars 521 forks source link

When VPN is connected, telepresence cant connect to the cluster #3704

Closed kaloyanDEV closed 1 month ago

kaloyanDEV commented 1 month ago

Describe the bug I cant curl -L services using service-name:port, service-name.namespace.svc.cluster.local:port or using service ip:port

To Reproduce

telepresence connect
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

2024-10-10 11:53:28.7677 info --- 2024-10-10 11:53:28.7677 info Telepresence daemon v2.20.0 (api v3) starting... 2024-10-10 11:53:28.7677 info PID is 35664 2024-10-10 11:53:28.7677 info
2024-10-10 11:53:28.8832 info daemon/server-grpc : gRPC server started 2024-10-10 11:53:31.8384 info daemon/session : -- Starting new session 2024-10-10 11:53:32.4750 info daemon/session : Connected to OSS Traffic Manager v2.20.0 2024-10-10 11:53:32.4754 info daemon/session : Connected to Manager 2.20.0 2024-10-10 11:53:32.5205 info daemon/session : also-proxy subnets [] 2024-10-10 11:53:32.5205 info daemon/session : never-proxy subnets [10.140.180.96/32] 2024-10-10 11:53:32.5205 info daemon/session : allow-conflicting subnets [] 2024-10-10 11:53:32.5210 info daemon/session : Configuration reloaded 2024-10-10 11:53:32.5647 info daemon/session/network : also-proxy subnets [] 2024-10-10 11:53:32.5647 info daemon/session/network : never-proxy subnets [10.140.180.96/32] 2024-10-10 11:53:32.5647 info daemon/session/network : allow-conflicting subnets [] 2024-10-10 11:53:32.9859 info daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.0 2024-10-10 11:53:33.0648 warning daemon/session/network : Manager IP 10.130.6.108 is connectable but not a traffic-manager instance (rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded). Will proxy pods, but this may interfere with your VPN routes. 2024-10-10 11:53:33.5654 info daemon/session/network : Adding service subnet 172.30.0.0/16 2024-10-10 11:53:33.5654 info daemon/session/network : Adding pod subnet 10.128.0.0/20 2024-10-10 11:53:33.5654 info daemon/session/network : Adding pod subnet 10.129.0.0/21 2024-10-10 11:53:33.5654 info daemon/session/network : Adding pod subnet 10.130.4.0/22 2024-10-10 11:53:33.5654 info daemon/session/network : Adding pod subnet 10.131.0.0/21 2024-10-10 11:53:33.5731 info daemon/session/network : Creating interface tel0 2024-10-10 11:53:33.6442 info stdlog : Using existing driver 0.14 2024-10-10 11:53:33.6488 info stdlog : Creating adapter 2024-10-10 11:53:34.0791 info daemon/session/network : Setting cluster DNS to 10.130.6.108 2024-10-10 11:53:34.0791 info daemon/session/network : Setting cluster domain to "cluster.local." 2024-10-10 11:53:34.0791 info daemon/session/network : Dropping never-proxy "10.140.180.96/32" because it is not routed 2024-10-10 11:53:34.0819 info daemon/session/network : Starting Endpoint 2024-10-10 11:53:34.1774 info daemon/metriton : scout report "update_routes" failed: Post "https://metriton.datawire.io/scout": dial tcp: lookup metriton.datawire.io: no such host 2024-10-10 11:53:36.5188 error daemon/session/network : failed to retrieve route for subnet 172.30.0.0/16: 2024-10-10 11:53:38.7612 info daemon/session/dns : Using fallback DNS server: 10.58.194.16 2024-10-10 11:53:58.8399 error daemon/session/dns/SearchPaths : DNS doesn't seem to work properly

telepresence quit -s
telepresence connect --proxy-via service=core
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

2024-10-10 12:01:37.9361 info Telepresence daemon v2.20.0 (api v3) starting... 2024-10-10 12:01:37.9361 info PID is 33036 2024-10-10 12:01:37.9361 info
2024-10-10 12:01:38.0415 info daemon/server-grpc : gRPC server started 2024-10-10 12:01:41.3524 info daemon/session : -- Starting new session 2024-10-10 12:01:42.0104 info daemon/session : Connected to OSS Traffic Manager v2.20.0 2024-10-10 12:01:42.0104 info daemon/session : Connected to Manager 2.20.0 2024-10-10 12:01:42.0587 info daemon/session : also-proxy subnets [] 2024-10-10 12:01:42.0587 info daemon/session : never-proxy subnets [10.140.180.96/32] 2024-10-10 12:01:42.0587 info daemon/session : allow-conflicting subnets [] 2024-10-10 12:01:42.0587 info daemon/session : Configuration reloaded 2024-10-10 12:01:42.1563 info daemon/session/network : also-proxy subnets [] 2024-10-10 12:01:42.1563 info daemon/session/network : never-proxy subnets [10.140.180.96/32] 2024-10-10 12:01:42.1563 info daemon/session/network : allow-conflicting subnets [] 2024-10-10 12:01:42.5549 info daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.0 2024-10-10 12:01:42.6566 warning daemon/session/network : Manager IP 10.130.6.108 is connectable but not a traffic-manager instance (rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded). Will proxy pods, but this may interfere with your VPN routes. 2024-10-10 12:01:43.1574 info daemon/session/network : Will not proxy service subnet 172.30.0.0/16, because it is covered by --proxy-via service=core 2024-10-10 12:01:43.1575 info daemon/session/network : Adding pod subnet 10.128.0.0/20 2024-10-10 12:01:43.1575 info daemon/session/network : Adding pod subnet 10.129.0.0/21 2024-10-10 12:01:43.1575 info daemon/session/network : Adding pod subnet 10.130.4.0/22 2024-10-10 12:01:43.1575 info daemon/session/network : Adding pod subnet 10.131.0.0/21 2024-10-10 12:01:43.1666 info daemon/session/network : Creating interface tel0 2024-10-10 12:01:43.2470 info stdlog : Using existing driver 0.14 2024-10-10 12:01:43.2538 info stdlog : Creating adapter 2024-10-10 12:01:43.8814 info daemon/session/network : Setting cluster DNS to 10.130.6.108 2024-10-10 12:01:43.8814 info daemon/session/network : Setting cluster domain to "cluster.local." 2024-10-10 12:01:43.8814 info daemon/session/network : Dropping never-proxy "10.140.180.96/32" because it is not routed 2024-10-10 12:01:43.8846 info daemon/session/network : Starting Endpoint 2024-10-10 12:01:44.0589 info daemon/metriton : scout report "update_routes" failed: Post "https://metriton.datawire.io/scout": dial tcp: lookup metriton.datawire.io: no such host 2024-10-10 12:01:46.6750 error daemon/session/network : failed to retrieve route for subnet 10.128.0.0/20: 2024-10-10 12:01:49.1789 info daemon/session/dns : Using fallback DNS server: 10.58.194.16 2024-10-10 12:02:09.2079 error daemon/session/dns/SearchPaths : DNS doesn't seem to work properly 2024-10-10 12:02:19.1656 info daemon/session : -- Session ended 2024-10-10 12:02:19.1656 info daemon/session:shutdown_logger : shutting down (gracefully)... 2024-10-10 12:02:19.1656 info daemon/session/dns/Server:shutdown_logger : shutting down (gracefully)... 2024-10-10 12:02:19.1656 info daemon/session/dns:shutdown_logger : shutting down (gracefully)... 2024-10-10 12:02:19.6747 info daemon/metriton : scout report "incluster_dns_queries" failed: Post "https://metriton.datawire.io/scout": dial tcp: lookup metriton.datawire.io: no such host 2024-10-10 12:02:20.2108 error daemon/session/agentPods : goroutine "/daemon/session/agentPods" exited with error: rpc error: code = Canceled desc = context canceled 2024-10-10 12:02:20.2127 error daemon/session : proxy-via agent in core failed: context deadline exceeded 2024-10-10 12:02:20.2127 info daemon/session : Configuration reloaded

telepresence status
OSS User Daemon: Running
  Version           : 2.20.0
  Executable        : C:\telepresence\telepresence.exe
  Install ID        : 19a84e89-5960-47bc-be80-e6c3e31bd939
  Status            : Not connected
  Kubernetes server :
  Kubernetes context:
  Namespace         :
  Manager namespace :
  Intercepts        : 0 total
OSS Root Daemon: Running
  Version: v2.20.0
Traffic Manager: Not connected

Expected behavior To be able to connect

Versions (please complete the following information):

VPN-related bugs:

Additional context I am running ubuntu as VM on same windows host machine (again connected to VPN). There I can connect with telepresence. The issue there is I need to use FQDN of service to do curl-s

thallgren commented 1 month ago

@kaloyanDEV can you please try and enable debug logging, and then after you've done that, try the above connects again?

Daemon loglevels are configured in a file named config.yml in directory %APPDATA%\telepresence. Create the file and add the following content:

logLevels:
  userDaemon: debug
  rootDaemon: debug

This will make the logging more verbose and hopefully give more hints about where the source of the problem is.

kaloyanDEV commented 1 month ago
telepresence connect
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2024-10-11 09:48:50.9809 info    Telepresence daemon v2.20.0 (api v3) starting...
2024-10-11 09:48:50.9809 info    PID is 35908
2024-10-11 09:48:50.9809 info    
2024-10-11 09:48:50.9863 debug   Listener opened
2024-10-11 09:48:51.4980 info    daemon/server-grpc : gRPC server started
2024-10-11 09:48:55.9704 debug   daemon/server-grpc/conn=4 : Received gRPC Connect
2024-10-11 09:48:55.9704 info    daemon/session : -- Starting new session
2024-10-11 09:48:57.2177 debug   daemon/session : k8sPortForwardDialer.dial(ctx, Pod./traffic-manager-6bd7787469-6ftl8.ambassador, 8081)
2024-10-11 09:48:57.2177 debug   daemon/session : k8sPortForwardDialer.spdyDial(ctx, Pod./traffic-manager-6bd7787469-6ftl8.ambassador)
2024-10-11 09:48:57.6469 info    daemon/session : Connected to OSS Traffic Manager v2.20.0
2024-10-11 09:48:57.6469 info    daemon/session : Connected to Manager 2.20.0
2024-10-11 09:48:57.7069 debug   daemon/session : Creating session with id session_id:"3874e1a9-bcdc-430b-a15e-0e15c3e52aad" cluster_id:"424f58c5-5f9a-45d3-8e9f-2aa019f7f447" install_id:"19a84e89-5960-47bc-be80-e6c3e31bd939"
2024-10-11 09:48:57.7155 info    daemon/session : also-proxy subnets []
2024-10-11 09:48:57.7155 info    daemon/session : never-proxy subnets [10.140.180.96/32]
2024-10-11 09:48:57.7155 info    daemon/session : allow-conflicting subnets []
2024-10-11 09:48:57.7923 info    daemon/session : Configuration reloaded
2024-10-11 09:48:57.8337 debug   daemon/session : Returning session from new session session_id:"3874e1a9-bcdc-430b-a15e-0e15c3e52aad" cluster_id:"424f58c5-5f9a-45d3-8e9f-2aa019f7f447" install_id:"19a84e89-5960-47bc-be80-e6c3e31bd939"
2024-10-11 09:48:57.9035 info    daemon/session/network : also-proxy subnets []
2024-10-11 09:48:57.9035 info    daemon/session/network : never-proxy subnets [10.140.180.96/32]
2024-10-11 09:48:57.9035 info    daemon/session/network : allow-conflicting subnets []
2024-10-11 09:48:57.9035 debug   daemon/session/network : Performing pod connectivity check on IP 10.130.6.108 with timeout 500ms
2024-10-11 09:48:57.9352 debug   daemon/session/agentPods : WatchAgentPods starting
2024-10-11 09:48:58.1138 debug   daemon/session : k8sPortForwardDialer.dial(ctx, Pod./pds-849bcb89b6-hfq8n.dev1-multitenant, 32905)
2024-10-11 09:48:58.1138 debug   daemon/session : k8sPortForwardDialer.spdyDial(ctx, Pod./pds-849bcb89b6-hfq8n.dev1-multitenant)
2024-10-11 09:48:58.4126 warning daemon/session/network : Manager IP 10.130.6.108 is connectable but not a traffic-manager instance (rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded). Will proxy pods, but this may interfere with your VPN routes.
2024-10-11 09:48:58.4126 debug   daemon/session/network : Performing service connectivity check on https://172.30.102.115:443/healthz with Host agent-injector.ambassador and timeout 500ms
2024-10-11 09:48:58.5459 info    daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.0
2024-10-11 09:48:58.9129 debug   daemon/session/network : Will proxy services (Get "https://172.30.102.115:443/healthz": context deadline exceeded)
2024-10-11 09:48:58.9129 debug   daemon/session/network : WatchClusterInfo update
2024-10-11 09:48:58.9129 info    daemon/session/network : Adding service subnet 172.30.0.0/16
2024-10-11 09:48:58.9129 info    daemon/session/network : Adding pod subnet 10.128.0.0/20
2024-10-11 09:48:58.9129 info    daemon/session/network : Adding pod subnet 10.129.0.0/21
2024-10-11 09:48:58.9129 info    daemon/session/network : Adding pod subnet 10.130.4.0/22
2024-10-11 09:48:58.9129 info    daemon/session/network : Adding pod subnet 10.131.0.0/21
2024-10-11 09:48:58.9281 info    daemon/session/network : Creating interface tel0
2024-10-11 09:48:59.0534 info    stdlog : Using existing driver 0.14
2024-10-11 09:48:59.0597 info    stdlog : Creating adapter
2024-10-11 09:48:59.8880 info    daemon/session/network : Starting Endpoint
2024-10-11 09:48:59.8891 info    daemon/session/network : Setting cluster DNS to 10.130.6.108
2024-10-11 09:48:59.8891 info    daemon/session/network : Setting cluster domain to "cluster.local."
2024-10-11 09:48:59.8891 info    daemon/session/network : Dropping never-proxy "10.140.180.96/32" because it is not routed
2024-10-11 09:48:59.9327 info    daemon/metriton : scout report "update_routes" failed: Post "https://metriton.datawire.io/scout": dial tcp: lookup metriton.datawire.io: no such host
2024-10-11 09:49:02.6228 error   daemon/session/network : failed to retrieve route for subnet 172.30.0.0/16: <nil>
2024-10-11 09:49:05.2542 info    daemon/session/dns : Using fallback DNS server: 10.58.194.16
2024-10-11 09:49:05.2542 debug   daemon/session/dns/Server : SetDNS server: 10.130.6.108, searchList: [tel2-search], domain: "cluster.local."
2024-10-11 09:49:05.2639 debug   daemon/session/dns/Server : SetDNS done
2024-10-11 09:49:05.2639 debug   daemon/session/dns/SearchPaths : Performing initial recursion check with tel2-recursion-check.tel2-search
thallgren commented 1 month ago

Everything seems normal in that log. IS that the last thing that gets printed?

kaloyanDEV commented 1 month ago
2024-10-11 09:49:25.3789 error   daemon/session/dns/SearchPaths : DNS doesn't seem to work properly
2024-10-11 09:49:25.3789 debug   daemon/session/dns/SearchPaths : Recursion check finished
2024-10-11 09:49:25.4256 debug   daemon/session/dns/SearchPaths : SetDNS server: 10.130.6.108, searchList: [tel2-search dev1-multitenant], domain: "cluster.local."
2024-10-11 09:49:25.4457 debug   daemon/session/dns/SearchPaths : SetDNS done
thallgren commented 1 month ago

I have a hard time understanding why the recursion check takes 20 seconds to complete. It ought to be between 5 and 6 seconds (or much quicker if it succeeds). Does you machine have very limited resources somehow? What type of hardware is used here and what version of Windows 11?

kaloyanDEV commented 1 month ago
Processor   Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz   2.71 GHz
Installed RAM   32.0 GB (31.6 GB usable)
System type 64-bit operating system, x64-based processor

I tried connect from network which can reach the cluster without vpn and it worked. Could be some sort of overlap I tried to read it with understanding but I still clueless https://www.getambassador.io/docs/telepresence/latest/reference/vpn

WITHOUT VPN

Wireless LAN adapter Wi-Fi:

   Connection-specific DNS Suffix  . : 
   Link-local IPv6 Address . . . . . : 
   IPv4 Address. . . . . . . . . . . : 10.10.206.238
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.10.206.1

   Unknown adapter tel0:

   Connection-specific DNS Suffix  . : cluster.local
   IPv4 Address. . . . . . . . . . . : 10.128.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   IPv4 Address. . . . . . . . . . . : 10.129.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.248.0
   IPv4 Address. . . . . . . . . . . : 10.130.4.0
   Subnet Mask . . . . . . . . . . . : 255.255.252.0
   IPv4 Address. . . . . . . . . . . : 10.131.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.248.0
   IPv4 Address. . . . . . . . . . . : 172.30.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.0.0
   Default Gateway . . . . . . . . . :

WITH VPN

(Cisco VIF)
Ethernet adapter Ethernet 2:

   Connection-specific DNS Suffix  . : 
   Link-local IPv6 Address . . . . . : fe80::205:9aff:fe3c:7a00%3
   Link-local IPv6 Address . . . . . : fe80::3ca6:5ed5:4b85:8cbd%3
   IPv4 Address. . . . . . . . . . . : 10.142.167.29
   Subnet Mask . . . . . . . . . . . : 255.255.192.0
   Default Gateway . . . . . . . . . : ::
                                       10.142.128.1

Unknown adapter tel0:

   Connection-specific DNS Suffix  . : cluster.local
   IPv4 Address. . . . . . . . . . . : 10.128.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   IPv4 Address. . . . . . . . . . . : 10.129.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.248.0
   IPv4 Address. . . . . . . . . . . : 10.130.4.0
   Subnet Mask . . . . . . . . . . . : 255.255.252.0
   IPv4 Address. . . . . . . . . . . : 10.131.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.248.0
   IPv4 Address. . . . . . . . . . . : 172.30.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.0.0
   Default Gateway . . . . . . . . . :                                       
thallgren commented 1 month ago

I don't see any subnet overlap between the Cisco VPN and Telepresence, but it's likely that Cisco installs a DNS in a way that Telepresence is unable to override. Do you have any information about how they configure VPN? Any commands that can tell you what's going on?

thallgren commented 1 month ago

Closing due to unresponsive reporter.

gabrielpulga commented 23 hours ago

This is exactly what is happening in my local machine as well, can we reopen this thread? @thallgren

thallgren commented 22 hours ago

@gabrielpulga do you have a VPN issue related to Cisco VPN?

gabrielpulga commented 22 hours ago

@thallgren it is the same issue but with NordLayer as VPN

thallgren commented 22 hours ago

When you say "the same issue", do you mean that you don't see any overlapping subnets, but still have a malfunctioning DNS? Also, are you using the same version, and on a Windows platform?

gabrielpulga commented 22 hours ago

• Telepresence works correctly without the VPN connected. • Telepresence fails to connect when the VPN is active.

Error message when running telepresence connect with VPN:

telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

I’ve tried to upgrade our helm config to accept the allowConflictingSubnets routes but attempting to set it results in an error:

telepresence helm upgrade: error: stuck in pending state

thallgren commented 22 hours ago

@gabrielpulga there can be multiple reasons for this error, and they differ depending on what version you are using and platform you're on. I'm reluctant to reopen this issue until I know that this indeed is the same problem. I suggest you create a new ticket and provide logs etc. there. Chances are that what you experience here is a different issue.