Telepresence intercept request timeout while waiting for agent to arrive

RobertFischer commented 8 months ago

Describe the bug

I can connect but can't intercept anything and always get requested timed out while waiting for agent ... to arrive errors.

I can see in the kubectl get events that the api service that I am attempting to intercept is having a container terminated after running the telepresence intercept command and it's spun back up after the intercept times out.

telepresence_logs.zip traces.gz

Repeatedly running telepresence helm uninstall and telepresence helm install does not help. Rebooting does not help.

I installed sshfs after seeing it mentioned in some of the error messages. That hasn't helped.

I tried using --replace and specifying --namespace. Neither of those helped.

To Reproduce

$ telepresence quit -s
Telepresence Daemons quitting...done
$ telepresence connect --namespace=dev-robert-explore
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
Connected to context default, namespace dev-robert-explore (https://127.0.0.1:6443)
$ telepresence intercept api -- echo success
telepresence intercept: error: request timed out while waiting for agent api.dev-robert-explore to arrive

Expected behavior

Intercepting should work.

Versions (please complete the following information):

Output of telepresence version

$ telepresence version
Client         : v2.19.0
Root Daemon    : v2.19.0
User Daemon    : v2.19.0
Traffic Manager: v2.19.0
Traffic Agent  : docker.io/datawire/ambassador-telepresence-agent:1.14.4

Operating system of workstation running telepresence commands

$ uname -a
Linux dev-robert 6.5.0-1014-gcp #14~22.04.1-Ubuntu SMP Sat Feb 10 04:57:00 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes environment and Version [e.g. Minikube, bare metal, Google Kubernetes Engine]
```
$ k3s --version
k3s version v1.26.1+k3s1 (f10af367)
go version go1.19.5
```

RobertFischer commented 8 months ago

Weirdly, the cloud says that I am connected to the [default] namespace. Or is that the context?

thallgren commented 8 months ago

From the looks of it, your k3s cluster is fully exposed to your workstation, i.e. you can already access all services and pods directly. Telepresence therefore tries to create a very small subnet that will be responsible for DNS only, and for some reason it's not able to and gets:

failed to add subnet 192.168.0.1/30: netlink.RouteAdd: invalid argument

which means that the DNS to the cluster won't work as expected. It would be interesting to try and reproduce. Not sure why that is happening. I assume that you're on a Linux box? What type of Linux is it? And how do you install k3s? Any specific configs involved?

RobertFischer commented 8 months ago

Ubuntu Linux. I posted the uname -a in the original post. I did the install through the K3s install script, so nothing fancy there. There's no fancy DNS set-up on the host and I am not sure how the services would already be exposed...we certainly aren't doing that intentionally.

Is there something I can check to help debug this issue?

thallgren commented 8 months ago

I posted the uname -a in the original post.

You sure did. Sorry.

I found that the failed to add subnet message was a red herring, it's due to a bug causing the route to be added twice and it is harmless.

I installed k3s on my workstation (Fedora-38). It turns the workstation itself into a node (a kubectl get node -o wide reveals that the node has the same IP as the workstation on my intranet). It also installs rules into my iptables (for flannel) that gives me direct access to service IPs. If I create a service in the cluster, I can then curl its IP (from kubectl get svc -o wide) and get a response, without having telepresence connected. This is why Telepresence considers the cluster to be connected. But as I mentioned, this is not the problem.

I'm able to curl the service using its name through the DNS server that Telepresence installs when it is connected. I would assume that you are too? But I can also intercept and successfully curl it again (this time reaching my intercept handler) which means that this is very likely a problem with the workload that you're intercepting.

Things to try out:

Use a symbolic targetPort in the service instead of using a number. This will remove the need for Telepresence to inject an init-container.
Try to retrieve logs from the failing pod's containers. The traffic-agent container especially.
Review the kubectl get events list during intercept, do you see anything else besides that the container is terminated (which is normal, since the pod restarts when the agent is injected)?

RobertFischer commented 8 months ago

Yes, I can confirm that the workstation is a node.

$ kubectl get node -o wide
NAME         STATUS   ROLES                  AGE     VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
dev-robert   Ready    control-plane,master   2d18h   v1.26.1+k3s1   10.128.0.30   <none>        Ubuntu 22.04.4 LTS   6.5.0-1014-gcp   containerd://1.6.15-k3s1

I can also confirm that I can reach the services directly via IP without Telepresence running:

robert@dev-robert:~$ curl -I http://10.43.42.116
HTTP/1.1 403 Forbidden
X-Powered-By: Express
Access-Control-Allow-Origin: https://dev-robert.randomclient.com
Access-Control-Allow-Methods: GET,PUT,POST,DELETE,OPTIONS
Access-Control-Allow-Headers: Content-Type, Authorization, Content-Length, X-Requested-With
Access-Control-Allow-Credentials: true
set-cookie: connect.sid=s%3A7h1dKiO3yAZdLKX4eb5YLHB_0iMPbBCy.%2Bwp8kP1ekhUKmWNHoDoaqfSGcViR2DJilGztMlxbxiA; Path=/; HttpOnly
Date: Mon, 26 Feb 2024 15:12:57 GMT
Connection: keep-alive
Keep-Alive: timeout=5

robert@dev-robert:~$ ps aux | grep telepresence
robert     69889  0.0  0.0   7008  2304 pts/0    S+   15:13   0:00 grep --color=auto telepresence

Although I can't do that through DNS before connecting:

robert@dev-robert:~$ curl -I http://api.dev-robert-explore
curl: (6) Could not resolve host: api.dev-robert-explore
robert@dev-robert:~$ curl -I http://api.dev-robert-explore.cluster.local
curl: (6) Could not resolve host: api.dev-robert-explore.cluster.local

Once I do a telepresence connect, then I can reach the services:

robert@dev-robert:~$ curl -I http://api.dev-robert-explore
HTTP/1.1 403 Forbidden
X-Powered-By: Express
Access-Control-Allow-Origin: https://dev-robert.randomclient.com
Access-Control-Allow-Methods: GET,PUT,POST,DELETE,OPTIONS
Access-Control-Allow-Headers: Content-Type, Authorization, Content-Length, X-Requested-With
Access-Control-Allow-Credentials: true
set-cookie: connect.sid=s%3AfjaHzC9XiZ1tGeZgOjqez8iYvr-s77Ur.M5TB0HgKTHnTZYOmDuA6NEk555j%2FF793SDHcyUl0GqY; Path=/; HttpOnly
Date: Mon, 26 Feb 2024 15:15:32 GMT
Connection: keep-alive
Keep-Alive: timeout=5

Interesting thing happening with the events: I'm getting failed health checks.

31s         Normal    Created             pod/api-58dc9559fc-r9wqw                Created container api
31s         Normal    Started             pod/api-58dc9559fc-r9wqw                Started container api
31s         Normal    Pulled              pod/api-58dc9559fc-r9wqw                Container image "docker.io/datawire/ambassador-telepresence-agent:1.14.4" already present
 on machine
31s         Normal    Created             pod/api-58dc9559fc-r9wqw                Created container traffic-agent
31s         Normal    Started             pod/api-58dc9559fc-r9wqw                Started container traffic-agent
26s         Warning   Unhealthy           pod/api-58dc9559fc-r9wqw                Startup probe failed: Get "http://10.42.0.113:8080/public/health": read tcp 10.42.0.1:586
28->10.42.0.113:8080: read: connection reset by peer
21s         Warning   Unhealthy           pod/api-58dc9559fc-r9wqw                Startup probe failed: Get "http://10.42.0.113:8080/public/health": read tcp 10.42.0.1:403
74->10.42.0.113:8080: read: connection reset by peer
16s         Warning   Unhealthy           pod/api-58dc9559fc-r9wqw                Startup probe failed: HTTP probe failed with statuscode: 404

Now, I think that error is coming after the timeout fail hits.

RobertFischer commented 8 months ago

I'm disabling the health check and seeing if that solves the problem.

RobertFischer commented 8 months ago

Disabling the health check didn't do it. I'm now spinning up a new service which does nothing but sleep infinity on an ubuntu:latest image and then trying to intercept that.

RobertFischer commented 8 months ago

No dice.

robert@dev-robert:~$ kubectl describe deployment/dev -n dev-robert-explore
Name:               dev
Namespace:          dev-robert-explore
CreationTimestamp:  Mon, 26 Feb 2024 16:43:18 +0000
Labels:             app=dev
                    pienso_version=dev-latest
Annotations:        deployment.kubernetes.io/revision: 1
                    reloader.stakater.com/auto: true
Selector:           app=api
Replicas:           1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:       Recreate
MinReadySeconds:    90
Pod Template:
  Labels:  app=api
           pienso_version=dev-latest
  Containers:
   dev:
    Image:      ubuntu:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
      infinity
    Environment Variables from:
      api-env         ConfigMap  Optional: false
      api-secret-env  Secret     Optional: false
    Environment:      <none>
    Mounts:           <none>
  Volumes:            <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   dev-7f86655994 (1/1 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  7m48s  deployment-controller  Scaled up replica set dev-7f86655994 to 1
robert@dev-robert:~$ telepresence quit -s
Telepresence Daemons quitting...done
robert@dev-robert:~$ telepresence connect --namespace dev-robert-explore
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
Connected to context default, namespace dev-robert-explore (https://127.0.0.1:6443)
robert@dev-robert:~$ telepresence intercept dev -- env
telepresence intercept: error: request timed out while waiting for agent dev.dev-robert-explore to arrive

Nothing interesting in the events.

42s         Normal   ScalingReplicaSet   deployment/dev              Scaled down replica set dev-574f44fb69 to 0 from 1
42s         Normal   SuccessfulDelete    replicaset/dev-574f44fb69   Deleted pod: dev-574f44fb69-zpcd2
42s         Normal   Killing             pod/dev-574f44fb69-zpcd2    Stopping container dev
11s         Normal   ScalingReplicaSet   deployment/dev              Scaled up replica set dev-8bc6866bd to 1
11s         Normal   SuccessfulCreate    replicaset/dev-8bc6866bd    Created pod: dev-8bc6866bd-4wclz
10s         Normal   Scheduled           pod/dev-8bc6866bd-4wclz     Successfully assigned dev-robert-explore/dev-8bc6866bd-4wclz to dev-robert
10s         Normal   Pulled              pod/dev-8bc6866bd-4wclz     Container image "ubuntu:latest" already present on machine
10s         Normal   Created             pod/dev-8bc6866bd-4wclz     Created container dev
10s         Normal   Started             pod/dev-8bc6866bd-4wclz     Started container dev

I tried to capture the container logs from the traffic-agent, but that's just not working. This isn't giving me anything even as I start and stop it repeatedly during the intercept.

robert@dev-robert:~$ kubectl logs -f --all-containers -n dev-robert-explore deployment/dev

Anything else to try? Is there someplace else that I can ask for help? I'm getting to the point where I'm giving up on this PoC of Telepresence, but I feel like I'm so close and I want it to work so badly...

Here's the logs again from these most recent attempts. telepresence_logs.zip traces.gz

I'm going to try spinning up an entirely new dev environment and see if there's just something borked with my current dev environment.

RobertFischer commented 8 months ago

Yeah, I tried watch 'kubectl logs --all-containers -n dev-robert-explore deployment/dev | tee -a ./intercept.log' and got nothing.

RobertFischer commented 8 months ago

telepresence_logs.zip traces.gz

Spinning up a brand new dev environment didn't change anything that I can see. I've attached the logs.

robert@dev-robert2:~$ telepresence connect --namespace=dev-robert2-explore
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
Connected to context default, namespace dev-robert2-explore (https://127.0.0.1:6443)
robert@dev-robert2:~$ telepresence intercept dev -- /bin/bash
telepresence intercept: error: request timed out while waiting for agent dev.dev-robert2-explore to arrive

RobertFischer commented 8 months ago

Another weird issue. I tried bumping up the timeout via the config.yml in order to maybe capture more logs or be able to poke around a bit more, but the increased timeout doesn't seem to be taking.

+ exec telepresence intercept dev -- /bin/bash
telepresence intercept: error: request timed out while waiting for agent dev.dev-robert-explore to arrive

real    0m39.974s
user    0m4.259s
sys 0m0.217s

$ cat ~/.config/telepresence/config.yml
---
client:
  timeouts:
    intercept: 5m
    agentInstall: 5m
    trafficManagerConnect: 5m

timeouts:
  helm: 5m
  intercept: 5m
  agentInstall: 5m
  trafficManagerConnect: 5m

RobertFischer commented 8 months ago

Using a symbolic port name didn't help.

robert@dev-robert$ kubectl describe pod dev -n dev-robert-explore
Name:                 dev-66cfb695f6-c6w7c
Namespace:            dev-robert-explore
Priority:             1000
Priority Class Name:  default-priority
Service Account:      default
Node:                 dev-robert/10.128.0.30
Start Time:           Mon, 26 Feb 2024 20:11:37 +0000
Labels:               app=dev
                      pienso_version=dev-latest
                      pod-template-hash=66cfb695f6
Annotations:          telepresence.getambassador.io/restartedAt: 2024-02-26T20:11:05Z
Status:               Running
IP:                   10.42.0.157
IPs:
  IP:           10.42.0.157
Controlled By:  ReplicaSet/dev-66cfb695f6
Containers:
  dev:
    Container ID:  containerd://675eabd347de5f1883b37e97bcab052776b7806faf29baefcb0fef079fed4eba
    Image:         ubuntu:latest
    Image ID:      docker.io/library/ubuntu@sha256:f9d633ff6640178c2d0525017174a688e2c1aef28f0a0130b26bd5554491f0da
    Port:          80/TCP
    Host Port:     0/TCP
    Command:
      sleep
      infinity
    State:          Running
      Started:      Mon, 26 Feb 2024 20:11:38 +0000
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      api-env         ConfigMap  Optional: false
      api-secret-env  Secret     Optional: false
    Environment:      <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5bt49 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-5bt49:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m16s  default-scheduler  Successfully assigned dev-robert-explore/dev-66cfb695f6-c6w7c to dev-robert
  Normal  Pulled     3m17s  kubelet            Container image "ubuntu:latest" already present on machine
  Normal  Created    3m16s  kubelet            Created container dev
  Normal  Started    3m16s  kubelet            Started container dev

robert@dev-robert$ kubectl describe service dev -n dev-robert-explore
Name:              dev
Namespace:         dev-robert-explore
Labels:            <none>
Annotations:       <none>
Selector:          app=dev
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.43.54.120
IPs:               10.43.54.120
Port:              <unset>  80/TCP
TargetPort:        http/TCP
Endpoints:         10.42.0.157:80
Session Affinity:  ClientIP
Events:            <none>

Same timeout when I try to intercept.

RobertFischer commented 8 months ago

Is there a port that needs to be open on the host for telepresence to work?

cindymullins-dw commented 8 months ago

You might try this annotation. Essentially this makes one step (where usually Telepresence starts the intercept ) into two steps where instead you run the intercept after. We've found this sometimes helps where the the telepresence intercept causes a "waiting for agent to arrive" type error.

To you question above, I don't think so. On that same page you'll see "Telepresence will automatically find all services and all ports that will connect to a workload and make them available for an intercept, but you can explicitly define that only one service and/or port can be intercepted." If you search on 'port' on that page you'll see some other notes there including numeric vs symbolic port (symbolic port is generally preferred as Thomas noted.

RobertFischer commented 8 months ago

I'm confused: if I use that annotation to prevent injecting the traffic agent, how will I get the traffic? I guess I'm unclear about what you mean by " you run the intercept after". How do I do that?

RobertFischer commented 8 months ago

Didn't seem to work.

robert@dev-robert$ telepresence quit
Disconnected
robert@dev-robert$ telepresence connect -n dev-robert-explore
Connected to context default, namespace dev-robert-explore (https://127.0.0.1:6443)
robert@dev-robert$ telepresence intercept dev -- /bin/bash
telepresence intercept: error: request timed out while waiting for agent dev.dev-robert-explore to arrive
robert@dev-robert$ kubectl describe deployment -n dev-robert-explore dev
Name:               dev
Namespace:          dev-robert-explore
CreationTimestamp:  Mon, 26 Feb 2024 20:08:28 +0000
Labels:             app=dev
                    pienso_version=dev-latest
Annotations:        deployment.kubernetes.io/revision: 4
                    reloader.stakater.com/auto: true
                    telepresence.getambassador.io/inject-traffic-agent: disabled
Selector:           app=dev
Replicas:           1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:       Recreate
MinReadySeconds:    90
Pod Template:
  Labels:       app=dev
                pienso_version=dev-latest
  Annotations:  telepresence.getambassador.io/restartedAt: 2024-02-27T14:29:29Z
  Containers:
   dev:
    Image:      ubuntu:latest
    Port:       80/TCP
    Host Port:  0/TCP
    Command:
      sleep
      infinity
    Environment Variables from:
      api-env         ConfigMap  Optional: false
      api-secret-env  Secret     Optional: false
    Environment:      <none>
    Mounts:           <none>
  Volumes:            <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    True    ReplicaSetUpdated
OldReplicaSets:  <none>
NewReplicaSet:   dev-567dbd4dfc (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  2m2s  deployment-controller  Scaled down replica set dev-66cfb695f6 to 0 from 1
  Normal  ScalingReplicaSet  91s   deployment-controller  Scaled up replica set dev-846f55797 to 1
  Normal  ScalingReplicaSet  55s   deployment-controller  Scaled down replica set dev-846f55797 to 0 from 1
  Normal  ScalingReplicaSet  24s   deployment-controller  Scaled up replica set dev-567dbd4dfc to 1
robert@dev-robert$ kubectl describe pod -n dev-robert-explore -l app=dev
Name:                 dev-567dbd4dfc-j5sqn
Namespace:            dev-robert-explore
Priority:             1000
Priority Class Name:  default-priority
Service Account:      default
Node:                 dev-robert/10.128.0.30
Start Time:           Tue, 27 Feb 2024 14:30:00 +0000
Labels:               app=dev
                      pienso_version=dev-latest
                      pod-template-hash=567dbd4dfc
Annotations:          telepresence.getambassador.io/restartedAt: 2024-02-27T14:29:29Z
Status:               Running
IP:                   10.42.0.223
IPs:
  IP:           10.42.0.223
Controlled By:  ReplicaSet/dev-567dbd4dfc
Containers:
  dev:
    Container ID:  containerd://ec51292000a9c7ed14c530dd407f7002ae4a35ca053229b8fb3120560040b4c7
    Image:         ubuntu:latest
    Image ID:      docker.io/library/ubuntu@sha256:f9d633ff6640178c2d0525017174a688e2c1aef28f0a0130b26bd5554491f0da
    Port:          80/TCP
    Host Port:     0/TCP
    Command:
      sleep
      infinity
    State:          Running
      Started:      Tue, 27 Feb 2024 14:30:01 +0000
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      api-env         ConfigMap  Optional: false
      api-secret-env  Secret     Optional: false
    Environment:      <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-n6z7v (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-n6z7v:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  46s   default-scheduler  Successfully assigned dev-robert-explore/dev-567dbd4dfc-j5sqn to dev-robert
  Normal  Pulled     46s   kubelet            Container image "ubuntu:latest" already present on machine
  Normal  Created    46s   kubelet            Created container dev
  Normal  Started    46s   kubelet            Started container dev

Is there a step that I'm missing?

Logs: telepresence_logs.zip traces.gz

RobertFischer commented 8 months ago

Is there any other information that I can give you to help debug what's going on? Any idea why the timeouts aren't working? (Maybe that has something to do with it...?) https://github.com/telepresenceio/telepresence/issues/3531#issuecomment-1964993327

RobertFischer commented 8 months ago

I'm seeing these errors in the logs. Do they mean anything?

2024-02-26 15:15:25.7892 error   failed to review capability: rpc error: code = InvalidArgument desc = capabilities should not be empty
2024-02-26 15:54:19.9945 error   Unable to read cloudd-address.json: failed to parse JSON from file /home/robert/.cache/telepresence/cloudd-address.json: json: cannot unmarshal string into Go value of type cloudd.Address

RobertFischer commented 8 months ago

Also, a lot of this stuff....

connector-20240227T141100.log:2024-02-26 15:39:03.1604 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 15:39:03.3861 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 16:51:30.8955 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 16:51:32.1114 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 16:57:52.1297 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 16:57:53.4980 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 17:06:54.5015 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 17:06:56.3886 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 17:44:26.4920 error   connector/session/cloud-daemon-watcher/client : goroutine "/connector/session/cloud-daemon-watcher/client" exited with error: rpc error: code = Unavailable desc = error reading from server: EOF
connector-20240227T141100.log:2024-02-26 17:44:26.4922 info    connector/session/cloud-daemon-watcher:shutdown_status :     /connector/session/cloud-daemon-watcher/client: exited with error
connector-20240227T141100.log:2024-02-26 18:51:36.5127 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 18:51:38.1858 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 18:51:46.6389 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 18:51:47.9200 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 18:55:23.1854 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: unable to get cloud daemon config from traffic-manager: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 18:55:24.3231 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 19:00:58.1099 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 19:00:59.3149 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 19:02:28.8410 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 19:02:30.5878 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 19:03:58.6638 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 19:04:01.0881 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 19:06:40.6446 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 19:06:41.8224 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 19:19:36.3613 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 19:19:37.5870 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:11:00.4146 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:11:01.5802 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:24:19.9414 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:24:21.1844 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:24:26.1306 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:24:27.2314 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:25:27.9587 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:25:29.4896 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:27:23.8335 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:27:24.9331 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:28:35.0380 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:28:36.1327 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:30:22.9083 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:30:23.9482 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:46:43.9808 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:46:45.4811 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:47:01.9032 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:47:03.2011 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:47:54.3185 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:47:55.5828 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:48:45.5813 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:48:46.6411 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:49:20.3477 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:49:21.6120 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:52:57.5604 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:52:59.1184 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 20:53:50.6075 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 20:53:51.9987 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector-20240227T141100.log:2024-02-26 21:01:08.9204 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 21:01:08.9204 error   connector/session/cloud-daemon-watcher/client : goroutine "/connector/session/cloud-daemon-watcher/client" exited with error: unable to proxy: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 21:01:08.9206 info    connector/session/cloud-daemon-watcher:shutdown_status :     /connector/session/cloud-daemon-watcher/client: exited with error
connector-20240227T141100.log:2024-02-26 21:01:08.9206 error   connector/session/cloud-daemon-watcher : unable to proxy: rpc error: code = Canceled desc = context canceled
connector-20240227T141100.log:2024-02-26 21:01:10.0055 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector.log:2024-02-27 14:16:16.4685 debug   connector/session/cloud-daemon-watcher : Recv failed: rpc error: code = Canceled desc = context canceled
connector.log:2024-02-27 14:16:16.4686 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector.log:2024-02-27 14:16:17.5464 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector.log:2024-02-27 14:16:32.7749 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector.log:2024-02-27 14:16:34.4843 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector.log:2024-02-27 14:17:49.1179 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector.log:2024-02-27 14:17:50.3135 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector.log:2024-02-27 14:18:01.8491 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector.log:2024-02-27 14:18:03.7291 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector.log:2024-02-27 14:29:09.5893 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector.log:2024-02-27 14:29:11.0015 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error
connector.log:2024-02-27 14:38:49.9878 debug   connector/session/cloud-daemon-watcher : Recv failed: rpc error: code = Canceled desc = context canceled
connector.log:2024-02-27 14:38:49.9878 error   connector/session/cloud-daemon-watcher : goroutine "/connector/session/cloud-daemon-watcher" exited with error: rpc error: code = Canceled desc = context canceled
connector.log:2024-02-27 14:38:51.1792 info    connector/session:shutdown_status :     /connector/session/cloud-daemon-watcher  : exited with error

thallgren commented 8 months ago

Looking at your logs (from the traffic-manager), I can verify that an attempt to install the traffic-agent happens here:

2024-02-27 14:28:22.0679 debug   httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="e4dd080c-a51c-44c0-a848-3d5af5085bdf"
2024-02-27 14:28:22.0848 debug   agent-configs : MODIFIED telepresence-agents.dev-robert-explore
2024-02-27 14:28:22.0849 debug   agent-configs : add dev.dev-robert-explore
2024-02-27 14:28:22.0938 debug   agent-configs : Rollout of dev.dev-robert-explore is necessary. An agent is desired but the pod dev-66cfb695f6-c6w7c doesn't have one
2024-02-27 14:28:22.0996 info    agent-configs : Successfully rolled out dev.dev-robert-explore

but from the looks of it, the agent-injector doesn't kick in here. That's very odd. It should definitely not say "skipping" here.

2024-02-27 14:28:53.4644 debug   agent-injector : Handling admission request CREATE dev-846f55797-.dev-robert-explore
2024-02-27 14:28:53.4722 debug   agent-injector : The dev-846f55797-.dev-robert-explore pod has not enabled traffic-agent container injection through "telepresence-agents" configmap or through "telepresence.getambassador.io/inject-traffic-agent" annotation; skipping
2024-02-27 14:28:53.4724 debug   agent-injector : Webhook request handled successfully

What does the telepresence-agents configmap contain at this point? Was this log printed at a time when you had a telepresence.getambassador.io/inject-traffic-agent: disabled annotation in the deployment?

thallgren commented 8 months ago

The errors regarding capability and cloudd-address.json aren't relevant to this problem.

thallgren commented 8 months ago

Oh, wait. There's a timeout here. The agent injector CREATE happens after an other modify after a 30 second timeout. That'd explain it:

2024-02-27 14:28:22.1200 debug   agent-injector : Handling admission request DELETE dev-66cfb695f6-c6w7c.dev-robert-explore
2024-02-27 14:28:22.1256 debug   agent-injector : Webhook request handled successfully
2024-02-27 14:28:52.0763 debug   agent-configs : MODIFIED telepresence-agents.dev-robert-explore
2024-02-27 14:28:52.0764 debug   agent-configs : del dev.dev-robert-explore
2024-02-27 14:28:52.0859 debug   agent-configs : Rollout of dev.dev-robert-explore is not necessary. All pods have the desired agent state

So this is probably caused by the fact that your agentInstall timeout doesn't take.

RobertFischer commented 8 months ago

@thallgren -- Oh, that sounds like great news. Any idea why the timeout doesn't take? Can you give me a sample config.yaml which should set it?

thallgren commented 8 months ago

The reason that timeout doesn't take, is that it's a client-only setting. This is a traffic-manager timeout. Try this:

$ telepresence helm upgrade --set timeouts.agentArrival=120s

thallgren commented 8 months ago

We should really remove that timeout from the config.yml. It was added in a very early version of Telepresence where the client did the agent install. It's meaningless now.

RobertFischer commented 8 months ago

HOLY CRAP IT IS WORKING.

Logs, test script, etc. attached.

tptest.zip

RobertFischer commented 8 months ago

I'm leaving the issue open b/c I don't know the protocol for closing it properly, but this is resolved as far as I'm concerned. Thank you very much for all your help.

thallgren commented 8 months ago

Thanks for confirming that it worked.

I don't think we have a protocol for closing tickets, but if the problem is solved, then the ticket should be closed.

telepresenceio / telepresence

Telepresence intercept request timeout while waiting for agent to arrive #3531