telepresenceio / telepresence

Local development against a remote Kubernetes or OpenShift cluster
https://www.telepresence.io
Other
6.61k stars 521 forks source link

Telepresence in codespace: `telepresence connect: error: connector.Connect: rot daemon is not running` #3722

Closed billytrend-cohere closed 4 days ago

billytrend-cohere commented 1 week ago

Describe the bug

I'm trying to run telepresence in a codespace. But I see an error. The connector.log containers no clues:

2024-11-12 07:12:05.1197 info    Starting socket listener for /tmp/telepresence-connector.socket
2024-11-12 07:12:05.1199 info    ---
2024-11-12 07:12:05.1199 info    Telepresence Connector v2.19.1 (api v3) starting...
2024-11-12 07:12:05.1199 info    PID is 5590
2024-11-12 07:12:05.1199 info    
2024-11-12 07:12:05.1241 info    connector/server-grpc : gRPC server started
2024-11-12 07:12:05.1524 info    connector/session : -- Starting new session
2024-11-12 07:12:05.1524 info    connector/session : Connecting to k8s cluster...
2024-11-12 07:12:05.9750 info    connector/session : Server version v1.30.5-gke.1014003
2024-11-12 07:12:05.9751 info    connector/session : Context: gke_cohere-staging_us-central1_staging
2024-11-12 07:12:05.9751 info    connector/session : Server: https://34.134.248.136
2024-11-12 07:12:08.5994 info    connector/session : Will look for traffic manager in namespace ambassador
2024-11-12 07:12:08.5994 info    connector/session : Connected to context gke_cohere-staging_us-central1_stagin
g, namespace default (https://34.134.248.136)
2024-11-12 07:12:08.7261 info    connector/session : Connecting to traffic manager...
2024-11-12 07:12:09.8815 info    connector/session : Connected to Traffic Manager v2.19.6
2024-11-12 07:12:10.0890 info    connector/session : Configuration reloaded
2024-11-12 07:12:11.1780 info    connector/session:shutdown_logger : shutting down (gracefully)...
2024-11-12 07:17:01.8274 info    connector/session : -- Starting new session
2024-11-12 07:17:01.8281 info    connector/session : Connecting to k8s cluster...
2024-11-12 07:17:02.1596 info    connector/session : Server version v1.30.5-gke.1014003
2024-11-12 07:17:02.1596 info    connector/session : Context: gke_cohere-staging_us-central1_staging
2024-11-12 07:17:02.1596 info    connector/session : Server: https://34.134.248.136
2024-11-12 07:17:03.4734 info    connector/session : Will look for traffic manager in namespace ambassador
2024-11-12 07:17:03.4735 info    connector/session : Connected to context gke_cohere-staging_us-central1_stagin
g, namespace default (https://34.134.248.136)
2024-11-12 07:17:03.5985 info    connector/session : Connecting to traffic manager...
2024-11-12 07:17:04.9262 info    connector/session : Connected to Traffic Manager v2.19.6
2024-11-12 07:17:05.1346 info    connector/session : Configuration reloaded
2024-11-12 07:17:05.1348 info    connector/session:shutdown_logger : shutting down (gracefully)...

To Reproduce

When I run

telepresence connect --request-timeout=30s --kubeconfig ~/.kube/config --context xxx

I get telepresence connect: error: connector.Connect: rot daemon is not running (rot should be root)

Expected behavior

When I am running locally this connects my cluster to the context

Versions (please complete the following information):

thallgren commented 1 week ago

Telepresence requires NET_ADMIN capability and access to the /dev/net/tun device of the host that runs the codespace container, or it will fail to configure its virtual network device. Have you been able to configure that?

billytrend-cohere commented 1 week ago

Hi @thallgren, thanks for your reply! I have taken a look and am still hitting the same issue. Are there some logs maybe that would reveal the underlying issue?

I have modified the codespaces with the following:

    "build": { "dockerfile": "../Dockerfile" },
    "runArgs": [
        "--privileged",
        "--cap-add=NET_ADMIN",
        "--device=/dev/net/tun"
    ],

I also tried sudo setcap cap_net_admin+ep /usr/local/bin/telepresence

capsh --print appears to show cap_net_admin:

WARNING: libcap needs an update (cap=40 should have a name).
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40
Ambient set =
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=1000(codespace) euid=1000(codespace)
gid=1000(codespace)
groups=106(ssh),107(docker),989(pipx),990(python),991(oryx),992(golang),993(sdkman),994(rvm),995(php),996(conda),997(nvs),998(nvm),999(hugo),1000(codespace)
Guessed mode: UNCERTAIN (0)
thallgren commented 1 week ago

You'll find the telepresence logs under ~/.cache/telepresence/logs. The ones of interest here are connector.log and daemon.log. They will become even more interesting if you turn on debugging by adding the following to ~/.config/telepresence/config.yml, then do telepresence quit -s and then retry the telepresence connect.

logLevels:
  userDaemon: debug
  rootDaemon: debug
thallgren commented 1 week ago

Also, if possible, please try version 2.20.2. It contains several bugfixes that might affect the connect behavior.

billytrend-cohere commented 1 week ago

With those settings, I do not see a daemon.log. cli.log is empty and connector.log shows the following:

Here is the log

2024-11-13 22:44:13.6774 debug   connector/server-grpc/conn=2/Quit-2 : called
2024-11-13 22:44:13.6775 debug   connector/session : goroutine "/connector/session" exited
2024-11-13 22:44:13.6775 debug   connector/server-grpc/conn=2/Quit-2 : returned
2024-11-13 22:44:13.6775 info    connector:shutdown_logger : shutting down (gracefully)...
2024-11-13 22:44:13.6775 debug   connector/background-metriton : goroutine "/connector/background-metriton" exited
2024-11-13 22:44:13.6775 debug   connector/service : goroutine "/connector/service" exited
2024-11-13 22:44:13.6779 debug   connector/server-grpc : gRPC server ended
2024-11-13 22:44:13.6780 debug   connector/server-grpc : goroutine "/connector/server-grpc" exited
2024-11-13 22:44:13.6945 debug   connector/config-reload : goroutine "/connector/config-reload" exited
2024-11-13 22:44:19.1720 info    Starting socket listener for /tmp/telepresence-connector.socket
2024-11-13 22:44:19.1721 debug   Listener opened on /tmp/telepresence-connector.socket
2024-11-13 22:44:19.1722 info    ---
2024-11-13 22:44:19.1722 info    Telepresence Connector v2.19.1 (api v3) starting...
2024-11-13 22:44:19.1722 info    PID is 8233
2024-11-13 22:44:19.1722 info    
2024-11-13 22:44:19.1728 info    connector/server-grpc : gRPC server started
2024-11-13 22:44:19.2051 debug   connector/server-grpc/conn=1/Connect-1 : called
2024-11-13 22:44:19.2065 debug   connector/session : using namespace "default"
2024-11-13 22:44:19.2066 info    connector/session : -- Starting new session
2024-11-13 22:44:19.2067 info    connector/session : Connecting to k8s cluster...
2024-11-13 22:44:19.4055 info    connector/session : Server version v1.30.5-gke.1014003
2024-11-13 22:44:19.4055 info    connector/session : Context: gke_cohere-staging_us-central1_staging
2024-11-13 22:44:19.4055 info    connector/session : Server: https://34.134.248.136
2024-11-13 22:44:21.1411 info    connector/session : Will look for traffic manager in namespace ambassador
2024-11-13 22:44:21.1411 info    connector/session : Connected to context gke_cohere-staging_us-central1_staging, namespace default (https://34.134.248.136)
2024-11-13 22:44:21.2272 info    connector/session : Connecting to traffic manager...
2024-11-13 22:44:21.2273 debug   connector/session : checking that traffic-manager exists
2024-11-13 22:44:21.2854 debug   connector/session : creating port-forward
2024-11-13 22:44:21.4406 debug   connector/session : k8sPortForwardDialer.dial(ctx, Pod./traffic-manager-68548bff4b-4976v.ambassador, 8081)
2024-11-13 22:44:21.4407 debug   connector/session : k8sPortForwardDialer.spdyDial(ctx, Pod./traffic-manager-68548bff4b-4976v.ambassador)
2024-11-13 22:44:21.9049 info    connector/session : Connected to Traffic Manager v2.19.6
2024-11-13 22:44:21.9601 debug   connector/session : traffic-manager port-forward established, client was already known to the traffic-manager as "codespace@codespaces-a36db9"
2024-11-13 22:44:22.0170 debug   connector/session : Applying client configuration from cluster
2024-11-13 22:44:22.0170 debug   connector/session : cluster:
2024-11-13 22:44:22.0170 debug   connector/session :     mappedNamespaces:
2024-11-13 22:44:22.0170 debug   connector/session :         - ambassador
2024-11-13 22:44:22.0170 debug   connector/session :         - blobheart
2024-11-13 22:44:22.0171 debug   connector/session :         - bh-finetuning
2024-11-13 22:44:22.0171 debug   connector/session :         - bh-private-models
2024-11-13 22:44:22.0171 debug   connector/session :         - bh-private-models-evaluation
2024-11-13 22:44:22.0172 info    connector/session : Configuration reloaded
2024-11-13 22:44:22.8895 debug   connector/server-grpc/conn=1/Connect-1 : returned
2024-11-13 22:44:22.8895 info    connector/session:shutdown_logger : shutting down (gracefully)...
2024-11-13 22:44:22.8896 debug   connector/session/info-kicker-gke_cohere-staging_us-central1_staging-default-cn : Deleting daemon info gke_cohere-staging_us-central1_staging-default-cn.json because context was cancelled
2024-11-13 22:44:22.8896 debug   connector/session/info-watcher-gke_cohere-staging_us-central1_staging-default-cn : goroutine "/connector/session/info-watcher-gke_cohere-staging_us-central1_staging-default-cn" exited
2024-11-13 22:44:22.8899 debug   connector/session/info-kicker-gke_cohere-staging_us-central1_staging-default-cn : goroutine "/connector/session/info-kicker-gke_cohere-staging_us-central1_staging-default-cn" exited
billytrend-cohere commented 1 week ago

Is it possible that the root daemon is never run? I guess it's odd that we don't see a log for it. Also, I see that it maybe does not get run in "docker mode": https://github.com/telepresenceio/telepresence/blob/fce335576845968028964a807c559f1682279f49/pkg/client/cli/connect/daemon.go#L60

Alsooo telepresence connect only logs Launching Telepresence User Daemon. Should it also be logging that it is launching the root daemon?

Can i explicitly run the root daemon?

tysm for your support with this

billytrend-cohere commented 1 week ago

Actually i'm possibly misunderstanding docker mode. That seems to be activated by a cli flag.

It is still weird though that we never see this log during connect:

https://github.com/telepresenceio/telepresence/blob/fce335576845968028964a807c559f1682279f49/pkg/client/cli/connect/daemon.go#L25

thallgren commented 1 week ago

Sorry. I should have mentioned that in this configuration, the root daemon will run embedded in the connector process, so there will be no daemon.log.

billytrend-cohere commented 1 week ago

I just tried running 2.20.2 and I still see rot daemon is not running unfortunately.

billytrend-cohere commented 1 week ago

I built a version with this removed:

image

And I now see the Launching Telepresence Root Daemon:

Telepresence Daemons quitting...done
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
telepresence connect: error: connector.Connect: subnet 10.0.0.0/17 overlaps with existing route "10.0.0.0/16 via 10.0.0.70 dev eth0". Please see https://www.getambassador.io/docs/telepresence/latest/reference/vpn for more information

This feels like progress? or have i just broken it?

billytrend-cohere commented 1 week ago

hmm added --allow-conflicting-subnets 10.0.0.0/16 to the command and it all seems to be up and healthy but doing some testing

thallgren commented 1 week ago

This is interesting. Looks like you found the culprit!

As I wrote earlier, the root daemon is not supposed to run in this configuration (i.e. when the user daemon is running in a container), instead it will be embedded into the user daemon. However, that's when the user-daemon is running as root, and that doesn't seem to be the case here, and as a result, this code doesn't execute. The user daemon, not receiving the --embed-network flag, then assumes that the root daemon is already running.

So, instead of just removing the !daemon.GetUserClient(rc).Containerized(), you should amend it with a && os.Getuid() == 0, e.g.

    if err == nil && required && !(os.Getuid() == 0 && daemon.GetUserClient(rc).Containerized())
billytrend-cohere commented 1 week ago

Ty for confirming, have opened a pr!

thallgren commented 1 week ago

@billytrend-cohere I've released a 2.20.3-rc.1 version of Telepresence with the codespaces fixes included. Can you please give it a try and report back?

thallgren commented 4 days ago

Released in version 2.20.3

billytrend-cohere commented 3 days ago

@thallgren sorry to reopen. I'm seeing behaviour where my urls are available in the codespace terminal but not in the k3d cluster that I'm running. I do see this warning:

You are using the OSS client v2.20.3-51-gfce335576-YvDDgqkcZvPrPIYWs18S2A to connect to an enterprise traffic manager v2.19.6. Please consider installing an
enterprise client from [getambassador.io](http://getambassador.io/), or use "telepresence helm install" to install an OSS traffic-manager

I wonder if this is causing the problem. Am I able to use the non oss cli with the fix we have merged?

thallgren commented 3 days ago

That fix will probably be included when Ambassador Labs makes their next release., but you should be able to use the OSS client until then, if you can live with the warning.

Not sure what you mean by "urls are available in the codespace terminal", but unless it's the exact same problem that you reported earlier, I'd suggest you create a new ticket.