telepresenceio / telepresence

Local development against a remote Kubernetes or OpenShift cluster
https://www.telepresence.io
Other
6.54k stars 515 forks source link

When VPN is connected, telepresence cant connect to the cluster #3704

Open kaloyanDEV opened 4 days ago

kaloyanDEV commented 4 days ago

Describe the bug I cant curl -L services using service-name:port, service-name.namespace.svc.cluster.local:port or using service ip:port

To Reproduce

telepresence connect
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

2024-10-10 11:53:28.7677 info --- 2024-10-10 11:53:28.7677 info Telepresence daemon v2.20.0 (api v3) starting... 2024-10-10 11:53:28.7677 info PID is 35664 2024-10-10 11:53:28.7677 info
2024-10-10 11:53:28.8832 info daemon/server-grpc : gRPC server started 2024-10-10 11:53:31.8384 info daemon/session : -- Starting new session 2024-10-10 11:53:32.4750 info daemon/session : Connected to OSS Traffic Manager v2.20.0 2024-10-10 11:53:32.4754 info daemon/session : Connected to Manager 2.20.0 2024-10-10 11:53:32.5205 info daemon/session : also-proxy subnets [] 2024-10-10 11:53:32.5205 info daemon/session : never-proxy subnets [10.140.180.96/32] 2024-10-10 11:53:32.5205 info daemon/session : allow-conflicting subnets [] 2024-10-10 11:53:32.5210 info daemon/session : Configuration reloaded 2024-10-10 11:53:32.5647 info daemon/session/network : also-proxy subnets [] 2024-10-10 11:53:32.5647 info daemon/session/network : never-proxy subnets [10.140.180.96/32] 2024-10-10 11:53:32.5647 info daemon/session/network : allow-conflicting subnets [] 2024-10-10 11:53:32.9859 info daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.0 2024-10-10 11:53:33.0648 warning daemon/session/network : Manager IP 10.130.6.108 is connectable but not a traffic-manager instance (rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded). Will proxy pods, but this may interfere with your VPN routes. 2024-10-10 11:53:33.5654 info daemon/session/network : Adding service subnet 172.30.0.0/16 2024-10-10 11:53:33.5654 info daemon/session/network : Adding pod subnet 10.128.0.0/20 2024-10-10 11:53:33.5654 info daemon/session/network : Adding pod subnet 10.129.0.0/21 2024-10-10 11:53:33.5654 info daemon/session/network : Adding pod subnet 10.130.4.0/22 2024-10-10 11:53:33.5654 info daemon/session/network : Adding pod subnet 10.131.0.0/21 2024-10-10 11:53:33.5731 info daemon/session/network : Creating interface tel0 2024-10-10 11:53:33.6442 info stdlog : Using existing driver 0.14 2024-10-10 11:53:33.6488 info stdlog : Creating adapter 2024-10-10 11:53:34.0791 info daemon/session/network : Setting cluster DNS to 10.130.6.108 2024-10-10 11:53:34.0791 info daemon/session/network : Setting cluster domain to "cluster.local." 2024-10-10 11:53:34.0791 info daemon/session/network : Dropping never-proxy "10.140.180.96/32" because it is not routed 2024-10-10 11:53:34.0819 info daemon/session/network : Starting Endpoint 2024-10-10 11:53:34.1774 info daemon/metriton : scout report "update_routes" failed: Post "https://metriton.datawire.io/scout": dial tcp: lookup metriton.datawire.io: no such host 2024-10-10 11:53:36.5188 error daemon/session/network : failed to retrieve route for subnet 172.30.0.0/16: 2024-10-10 11:53:38.7612 info daemon/session/dns : Using fallback DNS server: 10.58.194.16 2024-10-10 11:53:58.8399 error daemon/session/dns/SearchPaths : DNS doesn't seem to work properly

telepresence quit -s
telepresence connect --proxy-via service=core
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

2024-10-10 12:01:37.9361 info Telepresence daemon v2.20.0 (api v3) starting... 2024-10-10 12:01:37.9361 info PID is 33036 2024-10-10 12:01:37.9361 info
2024-10-10 12:01:38.0415 info daemon/server-grpc : gRPC server started 2024-10-10 12:01:41.3524 info daemon/session : -- Starting new session 2024-10-10 12:01:42.0104 info daemon/session : Connected to OSS Traffic Manager v2.20.0 2024-10-10 12:01:42.0104 info daemon/session : Connected to Manager 2.20.0 2024-10-10 12:01:42.0587 info daemon/session : also-proxy subnets [] 2024-10-10 12:01:42.0587 info daemon/session : never-proxy subnets [10.140.180.96/32] 2024-10-10 12:01:42.0587 info daemon/session : allow-conflicting subnets [] 2024-10-10 12:01:42.0587 info daemon/session : Configuration reloaded 2024-10-10 12:01:42.1563 info daemon/session/network : also-proxy subnets [] 2024-10-10 12:01:42.1563 info daemon/session/network : never-proxy subnets [10.140.180.96/32] 2024-10-10 12:01:42.1563 info daemon/session/network : allow-conflicting subnets [] 2024-10-10 12:01:42.5549 info daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.0 2024-10-10 12:01:42.6566 warning daemon/session/network : Manager IP 10.130.6.108 is connectable but not a traffic-manager instance (rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded). Will proxy pods, but this may interfere with your VPN routes. 2024-10-10 12:01:43.1574 info daemon/session/network : Will not proxy service subnet 172.30.0.0/16, because it is covered by --proxy-via service=core 2024-10-10 12:01:43.1575 info daemon/session/network : Adding pod subnet 10.128.0.0/20 2024-10-10 12:01:43.1575 info daemon/session/network : Adding pod subnet 10.129.0.0/21 2024-10-10 12:01:43.1575 info daemon/session/network : Adding pod subnet 10.130.4.0/22 2024-10-10 12:01:43.1575 info daemon/session/network : Adding pod subnet 10.131.0.0/21 2024-10-10 12:01:43.1666 info daemon/session/network : Creating interface tel0 2024-10-10 12:01:43.2470 info stdlog : Using existing driver 0.14 2024-10-10 12:01:43.2538 info stdlog : Creating adapter 2024-10-10 12:01:43.8814 info daemon/session/network : Setting cluster DNS to 10.130.6.108 2024-10-10 12:01:43.8814 info daemon/session/network : Setting cluster domain to "cluster.local." 2024-10-10 12:01:43.8814 info daemon/session/network : Dropping never-proxy "10.140.180.96/32" because it is not routed 2024-10-10 12:01:43.8846 info daemon/session/network : Starting Endpoint 2024-10-10 12:01:44.0589 info daemon/metriton : scout report "update_routes" failed: Post "https://metriton.datawire.io/scout": dial tcp: lookup metriton.datawire.io: no such host 2024-10-10 12:01:46.6750 error daemon/session/network : failed to retrieve route for subnet 10.128.0.0/20: 2024-10-10 12:01:49.1789 info daemon/session/dns : Using fallback DNS server: 10.58.194.16 2024-10-10 12:02:09.2079 error daemon/session/dns/SearchPaths : DNS doesn't seem to work properly 2024-10-10 12:02:19.1656 info daemon/session : -- Session ended 2024-10-10 12:02:19.1656 info daemon/session:shutdown_logger : shutting down (gracefully)... 2024-10-10 12:02:19.1656 info daemon/session/dns/Server:shutdown_logger : shutting down (gracefully)... 2024-10-10 12:02:19.1656 info daemon/session/dns:shutdown_logger : shutting down (gracefully)... 2024-10-10 12:02:19.6747 info daemon/metriton : scout report "incluster_dns_queries" failed: Post "https://metriton.datawire.io/scout": dial tcp: lookup metriton.datawire.io: no such host 2024-10-10 12:02:20.2108 error daemon/session/agentPods : goroutine "/daemon/session/agentPods" exited with error: rpc error: code = Canceled desc = context canceled 2024-10-10 12:02:20.2127 error daemon/session : proxy-via agent in core failed: context deadline exceeded 2024-10-10 12:02:20.2127 info daemon/session : Configuration reloaded

telepresence status
OSS User Daemon: Running
  Version           : 2.20.0
  Executable        : C:\telepresence\telepresence.exe
  Install ID        : 19a84e89-5960-47bc-be80-e6c3e31bd939
  Status            : Not connected
  Kubernetes server :
  Kubernetes context:
  Namespace         :
  Manager namespace :
  Intercepts        : 0 total
OSS Root Daemon: Running
  Version: v2.20.0
Traffic Manager: Not connected

Expected behavior To be able to connect

Versions (please complete the following information):

VPN-related bugs:

Additional context I am running ubuntu as VM on same windows host machine (again connected to VPN). There I can connect with telepresence. The issue there is I need to use FQDN of service to do curl-s

thallgren commented 4 days ago

@kaloyanDEV can you please try and enable debug logging, and then after you've done that, try the above connects again?

Daemon loglevels are configured in a file named config.yml in directory %APPDATA%\telepresence. Create the file and add the following content:

logLevels:
  userDaemon: debug
  rootDaemon: debug

This will make the logging more verbose and hopefully give more hints about where the source of the problem is.

kaloyanDEV commented 4 days ago
telepresence connect
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2024-10-11 09:48:50.9809 info    Telepresence daemon v2.20.0 (api v3) starting...
2024-10-11 09:48:50.9809 info    PID is 35908
2024-10-11 09:48:50.9809 info    
2024-10-11 09:48:50.9863 debug   Listener opened
2024-10-11 09:48:51.4980 info    daemon/server-grpc : gRPC server started
2024-10-11 09:48:55.9704 debug   daemon/server-grpc/conn=4 : Received gRPC Connect
2024-10-11 09:48:55.9704 info    daemon/session : -- Starting new session
2024-10-11 09:48:57.2177 debug   daemon/session : k8sPortForwardDialer.dial(ctx, Pod./traffic-manager-6bd7787469-6ftl8.ambassador, 8081)
2024-10-11 09:48:57.2177 debug   daemon/session : k8sPortForwardDialer.spdyDial(ctx, Pod./traffic-manager-6bd7787469-6ftl8.ambassador)
2024-10-11 09:48:57.6469 info    daemon/session : Connected to OSS Traffic Manager v2.20.0
2024-10-11 09:48:57.6469 info    daemon/session : Connected to Manager 2.20.0
2024-10-11 09:48:57.7069 debug   daemon/session : Creating session with id session_id:"3874e1a9-bcdc-430b-a15e-0e15c3e52aad" cluster_id:"424f58c5-5f9a-45d3-8e9f-2aa019f7f447" install_id:"19a84e89-5960-47bc-be80-e6c3e31bd939"
2024-10-11 09:48:57.7155 info    daemon/session : also-proxy subnets []
2024-10-11 09:48:57.7155 info    daemon/session : never-proxy subnets [10.140.180.96/32]
2024-10-11 09:48:57.7155 info    daemon/session : allow-conflicting subnets []
2024-10-11 09:48:57.7923 info    daemon/session : Configuration reloaded
2024-10-11 09:48:57.8337 debug   daemon/session : Returning session from new session session_id:"3874e1a9-bcdc-430b-a15e-0e15c3e52aad" cluster_id:"424f58c5-5f9a-45d3-8e9f-2aa019f7f447" install_id:"19a84e89-5960-47bc-be80-e6c3e31bd939"
2024-10-11 09:48:57.9035 info    daemon/session/network : also-proxy subnets []
2024-10-11 09:48:57.9035 info    daemon/session/network : never-proxy subnets [10.140.180.96/32]
2024-10-11 09:48:57.9035 info    daemon/session/network : allow-conflicting subnets []
2024-10-11 09:48:57.9035 debug   daemon/session/network : Performing pod connectivity check on IP 10.130.6.108 with timeout 500ms
2024-10-11 09:48:57.9352 debug   daemon/session/agentPods : WatchAgentPods starting
2024-10-11 09:48:58.1138 debug   daemon/session : k8sPortForwardDialer.dial(ctx, Pod./pds-849bcb89b6-hfq8n.dev1-multitenant, 32905)
2024-10-11 09:48:58.1138 debug   daemon/session : k8sPortForwardDialer.spdyDial(ctx, Pod./pds-849bcb89b6-hfq8n.dev1-multitenant)
2024-10-11 09:48:58.4126 warning daemon/session/network : Manager IP 10.130.6.108 is connectable but not a traffic-manager instance (rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded). Will proxy pods, but this may interfere with your VPN routes.
2024-10-11 09:48:58.4126 debug   daemon/session/network : Performing service connectivity check on https://172.30.102.115:443/healthz with Host agent-injector.ambassador and timeout 500ms
2024-10-11 09:48:58.5459 info    daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.0
2024-10-11 09:48:58.9129 debug   daemon/session/network : Will proxy services (Get "https://172.30.102.115:443/healthz": context deadline exceeded)
2024-10-11 09:48:58.9129 debug   daemon/session/network : WatchClusterInfo update
2024-10-11 09:48:58.9129 info    daemon/session/network : Adding service subnet 172.30.0.0/16
2024-10-11 09:48:58.9129 info    daemon/session/network : Adding pod subnet 10.128.0.0/20
2024-10-11 09:48:58.9129 info    daemon/session/network : Adding pod subnet 10.129.0.0/21
2024-10-11 09:48:58.9129 info    daemon/session/network : Adding pod subnet 10.130.4.0/22
2024-10-11 09:48:58.9129 info    daemon/session/network : Adding pod subnet 10.131.0.0/21
2024-10-11 09:48:58.9281 info    daemon/session/network : Creating interface tel0
2024-10-11 09:48:59.0534 info    stdlog : Using existing driver 0.14
2024-10-11 09:48:59.0597 info    stdlog : Creating adapter
2024-10-11 09:48:59.8880 info    daemon/session/network : Starting Endpoint
2024-10-11 09:48:59.8891 info    daemon/session/network : Setting cluster DNS to 10.130.6.108
2024-10-11 09:48:59.8891 info    daemon/session/network : Setting cluster domain to "cluster.local."
2024-10-11 09:48:59.8891 info    daemon/session/network : Dropping never-proxy "10.140.180.96/32" because it is not routed
2024-10-11 09:48:59.9327 info    daemon/metriton : scout report "update_routes" failed: Post "https://metriton.datawire.io/scout": dial tcp: lookup metriton.datawire.io: no such host
2024-10-11 09:49:02.6228 error   daemon/session/network : failed to retrieve route for subnet 172.30.0.0/16: <nil>
2024-10-11 09:49:05.2542 info    daemon/session/dns : Using fallback DNS server: 10.58.194.16
2024-10-11 09:49:05.2542 debug   daemon/session/dns/Server : SetDNS server: 10.130.6.108, searchList: [tel2-search], domain: "cluster.local."
2024-10-11 09:49:05.2639 debug   daemon/session/dns/Server : SetDNS done
2024-10-11 09:49:05.2639 debug   daemon/session/dns/SearchPaths : Performing initial recursion check with tel2-recursion-check.tel2-search
thallgren commented 4 days ago

Everything seems normal in that log. IS that the last thing that gets printed?

kaloyanDEV commented 3 days ago
2024-10-11 09:49:25.3789 error   daemon/session/dns/SearchPaths : DNS doesn't seem to work properly
2024-10-11 09:49:25.3789 debug   daemon/session/dns/SearchPaths : Recursion check finished
2024-10-11 09:49:25.4256 debug   daemon/session/dns/SearchPaths : SetDNS server: 10.130.6.108, searchList: [tel2-search dev1-multitenant], domain: "cluster.local."
2024-10-11 09:49:25.4457 debug   daemon/session/dns/SearchPaths : SetDNS done
thallgren commented 3 days ago

I have a hard time understanding why the recursion check takes 20 seconds to complete. It ought to be between 5 and 6 seconds (or much quicker if it succeeds). Does you machine have very limited resources somehow? What type of hardware is used here and what version of Windows 11?

kaloyanDEV commented 3 days ago
Processor   Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz   2.71 GHz
Installed RAM   32.0 GB (31.6 GB usable)
System type 64-bit operating system, x64-based processor

I tried connect from network which can reach the cluster without vpn and it worked. Could be some sort of overlap I tried to read it with understanding but I still clueless https://www.getambassador.io/docs/telepresence/latest/reference/vpn

WITHOUT VPN

Wireless LAN adapter Wi-Fi:

   Connection-specific DNS Suffix  . : 
   Link-local IPv6 Address . . . . . : 
   IPv4 Address. . . . . . . . . . . : 10.10.206.238
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.10.206.1

   Unknown adapter tel0:

   Connection-specific DNS Suffix  . : cluster.local
   IPv4 Address. . . . . . . . . . . : 10.128.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   IPv4 Address. . . . . . . . . . . : 10.129.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.248.0
   IPv4 Address. . . . . . . . . . . : 10.130.4.0
   Subnet Mask . . . . . . . . . . . : 255.255.252.0
   IPv4 Address. . . . . . . . . . . : 10.131.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.248.0
   IPv4 Address. . . . . . . . . . . : 172.30.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.0.0
   Default Gateway . . . . . . . . . :

WITH VPN

(Cisco VIF)
Ethernet adapter Ethernet 2:

   Connection-specific DNS Suffix  . : 
   Link-local IPv6 Address . . . . . : fe80::205:9aff:fe3c:7a00%3
   Link-local IPv6 Address . . . . . : fe80::3ca6:5ed5:4b85:8cbd%3
   IPv4 Address. . . . . . . . . . . : 10.142.167.29
   Subnet Mask . . . . . . . . . . . : 255.255.192.0
   Default Gateway . . . . . . . . . : ::
                                       10.142.128.1

Unknown adapter tel0:

   Connection-specific DNS Suffix  . : cluster.local
   IPv4 Address. . . . . . . . . . . : 10.128.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   IPv4 Address. . . . . . . . . . . : 10.129.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.248.0
   IPv4 Address. . . . . . . . . . . : 10.130.4.0
   Subnet Mask . . . . . . . . . . . : 255.255.252.0
   IPv4 Address. . . . . . . . . . . : 10.131.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.248.0
   IPv4 Address. . . . . . . . . . . : 172.30.0.0
   Subnet Mask . . . . . . . . . . . : 255.255.0.0
   Default Gateway . . . . . . . . . :                                       
thallgren commented 2 days ago

I don't see any subnet overlap between the Cisco VPN and Telepresence, but it's likely that Cisco installs a DNS in a way that Telepresence is unable to override. Do you have any information about how they configure VPN? Any commands that can tell you what's going on?