telepresenceio / telepresence

Local development against a remote Kubernetes or OpenShift cluster
https://www.telepresence.io
Other
6.57k stars 517 forks source link

Kubernetes DNS Resolution Failing Spuriously on Mac #560

Closed esorey closed 4 years ago

esorey commented 6 years ago

I'm opening a telepresence session to our K8s cluster using method vpn-tcp without swapping out any deployments just to access K8s resources. Roughly half of the time I do this, it works perfectly. The other times, I get errors complaining that DNS resolution of the K8s addresses failed. However, when I run dig <K8s-IP>, it reports status NOERROR, so I know that there is no real issue with the addresses themselves. The only workaround I've found thus far is to restart my machine entirely. I've also tried the same setup on Linux and have not seen the issue after many runs. Specifically, this is on macOS Sierra 10.12.6. This seems like a particularly hairy issue that may be solved by the plans to run DNS locally, but I still wanted to document it.

Thank you!

ark3 commented 6 years ago

Your Mac ends up in some sort of broken state such that DNS resolution of cluster resources with vpn-tcp fails, and you have to restart the machine to fix things. Is that right? The next time this happens, could you please link to a gist of telepresence.log? Please try curl, dig, etc., so the DNS lookups are visible in the log.

If you have to redact the logfile contents, it may be easier to extract the sshuttle and pod logs. See the trace below for an example.

Thank you for your help!

$ telepresence
Starting proxy with method 'vpn-tcp', which has the following limitations: All processes are affected, only one telepresence can run per machine, and you can't use other VPNs. You may need to add cloud hosts with --also-proxy. For a full list of method limitations see https://telepresence.io/reference/methods.html
Volumes are rooted at $TELEPRESENCE_ROOT. See https://telepresence.io/howto/volumes.html for details.

No traffic is being forwarded from the remote Deployment to your local machine. You can use the --expose option to specify which ports you want to forward.

Guessing that Services IP range is 10.3.240.0/20. Services started after this point will be inaccessible if are outside this range; restart telepresence if you can't access a new Service.

@gke_datawireio_us-central1-a_telepresence-testing|bash-4.4$ curl -sk https://kubernetes/api/
{
  "kind": "APIVersions",
  "versions": [
    "v1"
  ],
  "serverAddressByClientCIDRs": [
    {
      "clientCIDR": "0.0.0.0/0",
      "serverAddress": "35.184.xx.xx"
    }
  ]
}@gke_datawireio_us-central1-a_telepresence-testing|bash-4.4$ 
@gke_datawireio_us-central1-a_telepresence-testing|bash-4.4$ dig kubernetes

; <<>> DiG 9.9.7-P3 <<>> kubernetes
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1260
;; flags: qr ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;kubernetes.                    IN      A

;; ANSWER SECTION:
kubernetes.             0       IN      A       10.3.240.1

;; Query time: 109 msec
;; SERVER: 2001:558:feed::1#53(2001:558:feed::1)
;; WHEN: Fri Apr 06 09:14:20 EDT 2018
;; MSG SIZE  rcvd: 44

@gke_datawireio_us-central1-a_telepresence-testing|bash-4.4$ host kubernetes
kubernetes has address 10.3.240.1
kubernetes has address 10.3.240.1
Host kubernetes not found: 3(NXDOMAIN)
@gke_datawireio_us-central1-a_telepresence-testing|bash-4.4$ 
@gke_datawireio_us-central1-a_telepresence-testing|bash-4.4$ exit
exit

$ fgrep "Launching: sshuttle" telepresence.log 
  20.5 TEL | [35] Launching: sshuttle-telepresence -v --dns --method auto -e 'ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -F /dev/null' --to-ns 127.0.0.1:9053 -r telepresence@localhost:58044 10.0.125.0/24 10.0.126.0/24 10.3.240.0/20 10.0.127.0/24 10.0.19.0/24

$ fgrep "35 |" telepresence.log                      
  20.8  35 | Starting sshuttle proxy.
  21.0  35 | firewall manager: Starting firewall with Python version 3.6.5                                
[...]

$ fgrep logs telepresence.log 
  12.7 TEL | [25] Launching: kubectl --context gke_datawireio_us-central1-a_telepresence-testing --namespace default logs -f telepresence-1523020367-938933-52094-1060178697-lcx4v --container telepresence-1523020367-938933-52094

$ fgrep "25 |" telepresence.log                                              
  19.4  25 | Listening...
  19.4  25 | 2018-04-06T13:13:09+0000 [-] Loading ./forwarder.py...                                                     
  19.4  25 | 2018-04-06T13:13:10+0000 [-] /etc/resolv.conf changed, reparsing                                           
  19.4  25 | 2018-04-06T13:13:10+0000 [-] Resolver added ('10.3.240.10', 53) to server list          
[...]
  40.6  25 | 2018-04-06T13:13:31+0000 [stdout#info] A query: b'kubernetes'
  40.6  25 | 2018-04-06T13:13:31+0000 [stdout#info] AAAA query, sending back A instead: b'kubernetes'
  40.6  25 | 2018-04-06T13:13:31+0000 [stdout#info] A query: b'kubernetes'
  40.6  25 | 2018-04-06T13:13:31+0000 [stdout#info] Result for b'kubernetes' is ['10.3.240.1']
  40.6  25 | 2018-04-06T13:13:31+0000 [stdout#info] Result for b'kubernetes' is ['10.3.240.1']
esorey commented 6 years ago

Your Mac ends up in some sort of broken state such that DNS resolution of cluster resources with vpn-tcp fails, and you have to restart the machine to fix things. Is that right? That's correct.

Here's my results from playing around with dig/curl/host and digging through logs. Let me know if any more info would be helpful, and thank you for looking into this!

$ ./dev/telepresence-global.sh
Starting proxy with method 'vpn-tcp', which has the following limitations: All processes are affected, only one telepresence can run per machine, and you can't use other VPNs. You may need to add cloud hosts with --also-proxy. For a full list of method limitations see https://telepresence.io/reference/methods.html
Volumes are rooted at $TELEPRESENCE_ROOT. See https://telepresence.io/howto/volumes.html for details.

No traffic is being forwarded from the remote Deployment to your local machine. You can use the --expose option to specify which ports you want to forward.

Password:
Guessing that Services IP range is 100.64.0.0/13. Services started after this point will be inaccessible if are outside this range; restart telepresence if you can't access a new Service.

##### Dig Results

$ dig kafka-kube-staging-1.us-east-1.iris.internal

                                        ; <<>> DiG 9.8.3-P1 <<>> kafka-kube-staging-1.us-east-1.iris.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50978
;; flags: qr ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
                                        ;kafka-kube-staging-1.us-east-1.iris.internal. IN A

;; ANSWER SECTION:
kafka-kube-staging-1.us-east-1.iris.internal. 0 IN A 172.30.148.226

;; Query time: 155 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Fri Apr  6 12:55:25 2018
;; MSG SIZE  rcvd: 78

####### Host Results

$ host kafka-kube-staging-1.us-east-1.iris.internal
kafka-kube-staging-1.us-east-1.iris.internal has address 172.30.148.226
kafka-kube-staging-1.us-east-1.iris.internal has address 172.30.148.226

####### curl Results
$ curl -v kafka-kube-staging-1.us-east-1.iris.internal
* Rebuilt URL to: kafka-kube-staging-1.us-east-1.iris.internal/
* Could not resolve host: kafka-kube-staging-1.us-east-1.iris.internal
* Closing connection 0
curl: (6) Could not resolve host: kafka-kube-staging-1.us-east-1.iris.internal

####### Log digging
$ fgrep "sshuttle" telepresence.log
10.5 TL | BEGIN SPAN vpn.py:200(connect_sshuttle)
19.1 TL | [37] Launching: ['sshuttle-telepresence', '-v', '--dns', '--method', 'auto', '-e', 'ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -F /dev/null', '--to-ns', '127.0.0.1:9053', '-r', 'telepresence@localhost:52760', '100.96.8.0/24', '172.30.150.42', '172.30.132.164', '100.96.2.0/24', '172.30.131.115', '100.96.3.0/24', '100.96.5.0/24', '172.30.32.205', '172.30.132.100', '172.30.148.64', '100.96.6.0/24', '54.211.179.120', '10.51.206.104', '172.30.132.185', '172.30.131.235', '172.30.130.60', '172.30.149.60', '100.96.0.0/24', '100.96.10.0/24', '100.96.9.0/24', '100.64.0.0/13', '172.30.150.180', '172.30.148.226', '52.7.76.172', '172.30.131.64', '172.30.132.168', '172.30.131.98', '54.158.43.211', '172.30.149.71', '100.96.4.0/24', '172.30.32.180', '172.30.130.168']...
19.1 TL | BEGIN SPAN vpn.py:242(connect_sshuttle,sshuttle-wait)
19.2 37 | Starting sshuttle proxy.
21.2 37 | >> pfctl -a sshuttle6-12300 -f /dev/stdin
21.2 37 | >> pfctl -a sshuttle-12300 -f /dev/stdin
22.5 TL | END SPAN vpn.py:242(connect_sshuttle,sshuttle-wait)    3.4s
22.5 TL | END SPAN vpn.py:200(connect_sshuttle)   12.0s
750.0 37 | >> pfctl -a sshuttle6-12300 -F all
750.0 37 | >> pfctl -a sshuttle-12300 -F all
754.8 TL |   12.0s   vpn.py:200(connect_sshuttle)
754.8 TL |    3.4s     vpn.py:242(connect_sshuttle,sshuttle-wait)

$ fgrep "37 |" telepresence.log
19.2 37 | Starting sshuttle proxy.
19.5 37 | firewall manager: Starting firewall with Python version 3.6.4
19.5 37 | firewall manager: ready method name pf.
19.5 37 | IPv6 enabled: True
19.5 37 | UDP enabled: False
19.5 37 | DNS enabled: True
19.5 37 | TCP redirector listening on ('::1', 12300, 0, 0).
19.5 37 | TCP redirector listening on ('127.0.0.1', 12300).
19.5 37 | DNS listening on ('::1', 12300, 0, 0).
19.5 37 | DNS listening on ('127.0.0.1', 12300).
19.5 37 | Starting client with Python version 3.6.4
19.5 37 | c : connecting to server...
20.1 37 | Warning: Permanently added '[localhost]:52760' (ECDSA) to the list of known hosts.
21.2 37 | Starting server with Python version 3.6.1
21.2 37 |  s: latency control setting = True
21.2 37 |  s: available routes:
21.2 37 | c : Connected.
21.2 37 | firewall manager: setting up.
21.2 37 | >> pfctl -s Interfaces -i lo -v
21.2 37 | >> pfctl -s all
21.2 37 | >> pfctl -a sshuttle6-12300 -f /dev/stdin
21.2 37 | >> pfctl -E
21.2 37 | >> pfctl -s Interfaces -i lo -v
21.2 37 | >> pfctl -s all
21.2 37 | >> pfctl -a sshuttle-12300 -f /dev/stdin
21.2 37 | >> pfctl -E
21.3 37 | c : DNS request from ('192.168.1.33', 49634) to None: 37 bytes
22.4 37 | c : DNS request from ('192.168.1.33', 59263) to None: 37 bytes
70.4 37 | c : DNS request from ('192.168.1.33', 29883) to None: 38 bytes
70.4 37 | c : DNS request from ('192.168.1.33', 11938) to None: 33 bytes
70.4 37 | c : DNS request from ('192.168.1.33', 19856) to None: 37 bytes
70.4 37 | c : DNS request from ('192.168.1.33', 38563) to None: 35 bytes
70.4 37 | c : DNS request from ('192.168.1.33', 57887) to None: 43 bytes
70.4 37 | c : DNS request from ('192.168.1.33', 40002) to None: 42 bytes
70.6 37 | c : DNS request from ('192.168.1.33', 23155) to None: 32 bytes
70.6 37 | c : DNS request from ('192.168.1.33', 18296) to None: 32 bytes
[...]
749.9 37 | Connection to localhost closed by remote host.
750.0 37 | >> pfctl -a sshuttle6-12300 -F all
750.0 37 | >> pfctl -X 15307307430862952187
750.0 37 | >> pfctl -a sshuttle-12300 -F all
750.0 37 | >> pfctl -X 15307307430862955067

$ fgrep logs telepresence.log
3.9 TL | [11] Launching: ['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'dev', 'logs', '-f', 'telepresence-1523044023-793812-2163-846fb6bc95-tkcvx', '--container', 'telepresence-1523044023-793812-2163']...

$ fgrep "11 |" telepresence.log
8.7 11 | Listening...
8.7 11 | 2018-04-06T19:47:12+0000 [-] Loading ./forwarder.py...
8.7 11 | 2018-04-06T19:47:13+0000 [-] /etc/resolv.conf changed, reparsing
8.7 11 | 2018-04-06T19:47:13+0000 [-] Resolver added ('100.64.0.10', 53) to server list
[...]
70.5 11 | 2018-04-06T19:48:15+0000 [stdout#info] A query: b'adservice.google.com'
70.6 11 | 2018-04-06T19:48:15+0000 [stdout#info] A query: b'apis.google.com'
70.6 11 | 2018-04-06T19:48:15+0000 [stdout#info] A query: b'clients5.google.com'
70.6 11 | 2018-04-06T19:48:15+0000 [stdout#info] A query: b'fonts.gstatic.com'
70.6 11 | 2018-04-06T19:48:15+0000 [stdout#info] A query: b'lh3.googleusercontent.com'
70.6 11 | 2018-04-06T19:48:15+0000 [stdout#info] A query: b'notifications.google.com'
70.6 11 | 2018-04-06T19:48:15+0000 [stdout#info] Result for b'apis.google.com' is ['172.217.7.206']
70.6 11 | 2018-04-06T19:48:15+0000 [stdout#info] Result for b'adservice.google.com' is ['172.217.12.226']
70.6 11 | 2018-04-06T19:48:15+0000 [stdout#info] Result for b'fonts.gstatic.com' is ['172.217.15.99']
70.6 11 | 2018-04-06T19:48:15+0000 [stdout#info] Result for b'notifications.google.com' is ['172.217.7.206']
70.7 11 | 2018-04-06T19:48:15+0000 [stdout#info] Result for b'clients5.google.com' is ['172.217.3.46']
70.7 11 | 2018-04-06T19:48:15+0000 [stdout#info] A query: b'www.google.com'
70.7 11 | 2018-04-06T19:48:15+0000 [stdout#info] A query: b'ogs.google.com'
70.7 11 | 2018-04-06T19:48:15+0000 [stdout#info] A query: b'ssl.gstatic.com'
70.7 11 | 2018-04-06T19:48:15+0000 [stdout#info] Result for b'lh3.googleusercontent.com' is ['172.217.13.225']
70.7 11 | 2018-04-06T19:48:15+0000 [stdout#info] Result for b'www.google.com' is ['172.217.9.196']
70.7 11 | 2018-04-06T19:48:15+0000 [stdout#info] Result for b'ssl.gstatic.com' is ['216.58.217.131']
70.8 11 | 2018-04-06T19:48:15+0000 [stdout#info] Result for b'ogs.google.com' is ['172.217.7.206']
70.8 11 | 2018-04-06T19:48:15+0000 [stdout#info] A query: b'www.gstatic.com'
70.8 11 | 2018-04-06T19:48:15+0000 [stdout#info] Result for b'www.gstatic.com' is ['172.217.15.67']
73.4 11 | 2018-04-06T19:48:18+0000 [stdout#info] A query: b'kubernetes'
73.4 11 | 2018-04-06T19:48:18+0000 [stdout#info] getaddrinfo error: [Errno -2] Name does not resolve
78.2 11 | 2018-04-06T19:48:22+0000 [stdout#info] A query: b'cuscochromeextension-pa.googleapis.com'
78.2 11 | 2018-04-06T19:48:22+0000 [stdout#info] Result for b'cuscochromeextension-pa.googleapis.com' is ['172.217.15.106', '172.217.15.74', '172.217.13.234', '172.217.13.74', '172.217.12.234', '172.217.9.202', '172.217.8.10', '172.217.7.138', '172.217.5.234']
78.3 11 | 2018-04-06T19:48:23+0000 [stdout#info] A query: b'www.googleapis.com'
78.3 11 | 2018-04-06T19:48:23+0000 [stdout#info] Result for b'www.googleapis.com' is ['172.217.15.74', '172.217.13.234', '172.217.13.74', '172.217.12.234', '172.217.9.202', '172.217.8.10', '172.217.7.138', '172.217.5.234', '172.217.15.106']
82.9 11 | 2018-04-06T19:48:27+0000 [stdout#info] A query: b'github.com'
82.9 11 | 2018-04-06T19:48:27+0000 [stdout#info] Result for b'github.com' is ['192.30.253.112', '192.30.253.113']
83.3 11 | 2018-04-06T19:48:28+0000 [stdout#info] A query: b'avatars2.githubusercontent.com'
83.3 11 | 2018-04-06T19:48:28+0000 [stdout#info] Result for b'avatars2.githubusercontent.com' is ['151.101.32.133']
131.7 11 | 2018-04-06T19:49:16+0000 [stdout#info] A query: b'kafka-kube-staging-1.us-east-1.iris.internal'
131.7 11 | 2018-04-06T19:49:16+0000 [stdout#info] Result for b'kafka-kube-staging-1.us-east-1.iris.internal' is ['172.30.148.226']
156.0 11 | 2018-04-06T19:49:40+0000 [stdout#info] A query: b'kafka-kube-staging-1.us-east-1.iris.internal'
156.0 11 | 2018-04-06T19:49:40+0000 [stdout#info] Result for b'kafka-kube-staging-1.us-east-1.iris.internal' is ['172.30.148.226']
156.1 11 | 2018-04-06T19:49:40+0000 [stdout#info] AAAA query, sending back A instead: b'kafka-kube-staging-1.us-east-1.iris.internal'
156.1 11 | 2018-04-06T19:49:40+0000 [stdout#info] A query: b'kafka-kube-staging-1.us-east-1.iris.internal'
156.1 11 | 2018-04-06T19:49:40+0000 [stdout#info] Result for b'kafka-kube-staging-1.us-east-1.iris.internal' is ['172.30.148.226']
156.2 11 | 2018-04-06T19:49:40+0000 [stdout#info] 15 query: b'kafka-kube-staging-1.us-east-1.iris.internal'
156.2 11 | 2018-04-06T19:49:40+0000 [DNSDatagramProtocol (UDP)] DNSDatagramProtocol starting on 46492
156.2 11 | 2018-04-06T19:49:40+0000 [DNSDatagramProtocol (UDP)] Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7fedb0770b38>
156.2 11 | 2018-04-06T19:49:40+0000 [-] (UDP Port 46492 Closed)
156.2 11 | 2018-04-06T19:49:40+0000 [-] Stopping protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7fedb0770b38>
187.9 11 | 2018-04-06T19:50:12+0000 [stdout#info] A query: b'tasks.google.com'
187.9 11 | 2018-04-06T19:50:12+0000 [stdout#info] Result for b'tasks.google.com' is ['172.217.7.174']
205.2 11 | 2018-04-06T19:50:29+0000 [stdout#info] A query: b'adservice.google.com'
205.3 11 | 2018-04-06T19:50:29+0000 [stdout#info] A query: b'apis.google.com'
205.3 11 | 2018-04-06T19:50:30+0000 [stdout#info] A query: b'clients5.google.com'
205.3 11 | 2018-04-06T19:50:30+0000 [stdout#info] A query: b'fonts.gstatic.com'
205.3 11 | 2018-04-06T19:50:30+0000 [stdout#info] A query: b'lh3.googleusercontent.com'
205.3 11 | 2018-04-06T19:50:30+0000 [stdout#info] A query: b'notifications.google.com'
205.3 11 | 2018-04-06T19:50:30+0000 [stdout#info] Result for b'fonts.gstatic.com' is ['172.217.9.195']
205.3 11 | 2018-04-06T19:50:30+0000 [stdout#info] Result for b'notifications.google.com' is ['172.217.12.238']
205.3 11 | 2018-04-06T19:50:30+0000 [stdout#info] Result for b'apis.google.com' is ['172.217.12.238']
205.3 11 | 2018-04-06T19:50:30+0000 [stdout#info] Result for b'clients5.google.com' is ['172.217.15.78']
205.3 11 | 2018-04-06T19:50:30+0000 [stdout#info] Result for b'lh3.googleusercontent.com' is ['172.217.7.193']
205.4 11 | 2018-04-06T19:50:30+0000 [stdout#info] Result for b'adservice.google.com' is ['172.217.15.98']
205.4 11 | 2018-04-06T19:50:30+0000 [stdout#info] A query: b'www.google.com'
205.4 11 | 2018-04-06T19:50:30+0000 [stdout#info] A query: b'ogs.google.com'
205.4 11 | 2018-04-06T19:50:30+0000 [stdout#info] A query: b'ssl.gstatic.com'
205.4 11 | 2018-04-06T19:50:30+0000 [stdout#info] A query: b'www.gstatic.com'
205.4 11 | 2018-04-06T19:50:30+0000 [stdout#info] Result for b'www.google.com' is ['172.217.7.196']
205.4 11 | 2018-04-06T19:50:30+0000 [stdout#info] Result for b'ssl.gstatic.com' is ['172.217.9.195']
205.5 11 | 2018-04-06T19:50:30+0000 [stdout#info] Result for b'ogs.google.com' is ['172.217.7.174']
205.5 11 | 2018-04-06T19:50:30+0000 [stdout#info] Result for b'www.gstatic.com' is ['172.217.15.99']
211.1 11 | 2018-04-06T19:50:35+0000 [stdout#info] A query: b'stackoverflow.com'
211.1 11 | 2018-04-06T19:50:35+0000 [stdout#info] A query: b'www.googleapis.com'
211.1 11 | 2018-04-06T19:50:35+0000 [stdout#info] Result for b'stackoverflow.com' is ['151.101.1.69', '151.101.65.69', '151.101.129.69', '151.101.193.69']
211.1 11 | 2018-04-06T19:50:35+0000 [stdout#info] Result for b'www.googleapis.com' is ['172.217.3.42', '216.58.217.106', '172.217.15.106', '172.217.15.74', '172.217.13.234', '172.217.13.74', '172.217.12.234', '172.217.9.202', '172.217.8.10', '172.217.7.202', '172.217.7.170', '172.217.7.138', '172.217.5.234']
211.4 11 | 2018-04-06T19:50:36+0000 [stdout#info] A query: b'cdn.sstatic.net'
211.4 11 | 2018-04-06T19:50:36+0000 [stdout#info] A query: b'i.stack.imgur.com'
211.4 11 | 2018-04-06T19:50:36+0000 [stdout#info] Result for b'i.stack.imgur.com' is ['104.16.111.18', '104.16.110.18', '104.16.109.18', '104.16.108.18', '104.16.112.18']
211.5 11 | 2018-04-06T19:50:36+0000 [stdout#info] Result for b'cdn.sstatic.net' is ['151.101.65.69', '151.101.129.69', '151.101.193.69', '151.101.1.69']
211.6 11 | 2018-04-06T19:50:36+0000 [stdout#info] A query: b'js-sec.indexww.com'
211.7 11 | 2018-04-06T19:50:36+0000 [stdout#info] A query: b'www.gravatar.com'
211.7 11 | 2018-04-06T19:50:36+0000 [stdout#info] Result for b'www.gravatar.com' is ['192.0.73.2']
211.7 11 | 2018-04-06T19:50:36+0000 [stdout#info] Result for b'js-sec.indexww.com' is ['23.36.33.160']
211.8 11 | 2018-04-06T19:50:36+0000 [stdout#info] A query: b'clients1.google.com'
211.9 11 | 2018-04-06T19:50:36+0000 [stdout#info] A query: b'clc.stackoverflow.com'
211.9 11 | 2018-04-06T19:50:36+0000 [stdout#info] A query: b'sb.scorecardresearch.com'
211.9 11 | 2018-04-06T19:50:36+0000 [stdout#info] A query: b'pixel.quantserve.com'
211.9 11 | 2018-04-06T19:50:36+0000 [stdout#info] Result for b'clients1.google.com' is ['172.217.13.78']
211.9 11 | 2018-04-06T19:50:36+0000 [stdout#info] A query: b'www.google-analytics.com'
211.9 11 | 2018-04-06T19:50:36+0000 [stdout#info] Result for b'clc.stackoverflow.com' is ['151.101.129.69', '151.101.65.69', '151.101.1.69', '151.101.193.69']
211.9 11 | 2018-04-06T19:50:36+0000 [stdout#info] Result for b'www.google-analytics.com' is ['172.217.15.110']
212.0 11 | 2018-04-06T19:50:36+0000 [stdout#info] Result for b'sb.scorecardresearch.com' is ['96.16.79.82']
212.0 11 | 2018-04-06T19:50:36+0000 [stdout#info] Result for b'pixel.quantserve.com' is ['66.150.118.33', '66.150.118.29', '66.150.118.26', '66.150.118.22', '66.150.118.60', '66.150.118.56', '66.150.118.50', '66.150.118.45']
212.2 11 | 2018-04-06T19:50:36+0000 [stdout#info] A query: b'stats.g.doubleclick.net'
212.2 11 | 2018-04-06T19:50:36+0000 [stdout#info] Result for b'stats.g.doubleclick.net' is ['173.194.204.154', '173.194.204.155', '173.194.204.156', '173.194.204.157']
217.9 11 | 2018-04-06T19:50:42+0000 [stdout#info] A query: b'clients4.google.com'
217.9 11 | 2018-04-06T19:50:42+0000 [stdout#info] Result for b'clients4.google.com' is ['172.217.13.78']
238.0 11 | 2018-04-06T19:51:02+0000 [stdout#info] A query: b'play.google.com'
238.0 11 | 2018-04-06T19:51:02+0000 [stdout#info] A query: b'clients6.google.com'
238.1 11 | 2018-04-06T19:51:02+0000 [stdout#info] Result for b'clients6.google.com' is ['172.217.13.78']
238.1 11 | 2018-04-06T19:51:02+0000 [stdout#info] Result for b'play.google.com' is ['172.217.15.78']
258.9 11 | 2018-04-06T19:51:23+0000 [stdout#info] A query: b'calendar.google.com'
258.9 11 | 2018-04-06T19:51:23+0000 [stdout#info] Result for b'calendar.google.com' is ['172.217.15.110']
282.9 11 | 2018-04-06T19:51:47+0000 [stdout#info] A query: b'lh3.googleusercontent.com'
282.9 11 | 2018-04-06T19:51:47+0000 [stdout#info] Result for b'lh3.googleusercontent.com' is ['172.217.15.97']
308.9 11 | 2018-04-06T19:52:13+0000 [stdout#info] A query: b'tasks.google.com'
308.9 11 | 2018-04-06T19:52:13+0000 [stdout#info] Result for b'tasks.google.com' is ['172.217.15.110']
418.8 11 | 2018-04-06T19:54:03+0000 [stdout#info] A query: b'play.google.com'
418.8 11 | 2018-04-06T19:54:03+0000 [stdout#info] Result for b'play.google.com' is ['172.217.15.110']
436.1 11 | 2018-04-06T19:54:20+0000 [stdout#info] A query: b'github.com'
436.1 11 | 2018-04-06T19:54:20+0000 [stdout#info] Result for b'github.com' is ['192.30.253.113', '192.30.253.112']
436.7 11 | 2018-04-06T19:54:21+0000 [stdout#info] A query: b'collector.githubapp.com'
436.7 11 | 2018-04-06T19:54:21+0000 [stdout#info] A query: b'www.google-analytics.com'
436.7 11 | 2018-04-06T19:54:21+0000 [stdout#info] A query: b'api.github.com'
436.7 11 | 2018-04-06T19:54:21+0000 [stdout#info] Result for b'www.google-analytics.com' is ['216.58.217.174']
436.8 11 | 2018-04-06T19:54:21+0000 [stdout#info] Result for b'collector.githubapp.com' is ['52.22.67.147', '54.236.197.250', '34.203.158.5']
436.8 11 | 2018-04-06T19:54:21+0000 [stdout#info] Result for b'api.github.com' is ['192.30.253.117', '192.30.253.116']
438.8 11 | 2018-04-06T19:54:23+0000 [stdout#info] A query: b'clients4.google.com'
438.8 11 | 2018-04-06T19:54:23+0000 [stdout#info] Result for b'clients4.google.com' is ['172.217.7.238']
466.0 11 | 2018-04-06T19:54:50+0000 [stdout#info] A query: b's-usc1c-nss-225.firebaseio.com'
466.0 11 | 2018-04-06T19:54:50+0000 [stdout#info] Result for b's-usc1c-nss-225.firebaseio.com' is ['35.201.97.85']
478.7 11 | 2018-04-06T19:55:03+0000 [stdout#info] A query: b'www.notion.so'
478.7 11 | 2018-04-06T19:55:03+0000 [stdout#info] Result for b'www.notion.so' is ['104.25.151.102', '104.25.152.102']
500.2 11 | 2018-04-06T19:55:24+0000 [stdout#info] A query: b'kafka-kube-staging-1.us-east-1.iris.internal'
500.3 11 | 2018-04-06T19:55:25+0000 [stdout#info] Result for b'kafka-kube-staging-1.us-east-1.iris.internal' is ['172.30.148.226']
536.8 11 | 2018-04-06T19:56:01+0000 [stdout#info] A query: b'play.google.com'
537.0 11 | 2018-04-06T19:56:01+0000 [stdout#info] Result for b'play.google.com' is ['172.217.7.174']
568.1 11 | 2018-04-06T19:56:32+0000 [stdout#info] A query: b'kafka-kube-staging-1.us-east-1.iris.internal'
568.1 11 | 2018-04-06T19:56:32+0000 [stdout#info] Result for b'kafka-kube-staging-1.us-east-1.iris.internal' is ['172.30.148.226']
568.2 11 | 2018-04-06T19:56:32+0000 [stdout#info] AAAA query, sending back A instead: b'kafka-kube-staging-1.us-east-1.iris.internal'
568.2 11 | 2018-04-06T19:56:32+0000 [stdout#info] A query: b'kafka-kube-staging-1.us-east-1.iris.internal'
568.2 11 | 2018-04-06T19:56:32+0000 [stdout#info] Result for b'kafka-kube-staging-1.us-east-1.iris.internal' is ['172.30.148.226']
568.3 11 | 2018-04-06T19:56:32+0000 [stdout#info] 15 query: b'kafka-kube-staging-1.us-east-1.iris.internal'
568.3 11 | 2018-04-06T19:56:32+0000 [DNSDatagramProtocol (UDP)] DNSDatagramProtocol starting on 20284
568.3 11 | 2018-04-06T19:56:32+0000 [DNSDatagramProtocol (UDP)] Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7fedb075d828>
568.3 11 | 2018-04-06T19:56:32+0000 [-] (UDP Port 20284 Closed)
568.3 11 | 2018-04-06T19:56:32+0000 [-] Stopping protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7fedb075d828>
658.9 11 | 2018-04-06T19:58:03+0000 [stdout#info] A query: b'$'
658.9 11 | 2018-04-06T19:58:03+0000 [stdout#info] AAAA query, sending back A instead: b'$'
658.9 11 | 2018-04-06T19:58:03+0000 [stdout#info] A query: b'$'
658.9 11 | 2018-04-06T19:58:03+0000 [stdout#info] getaddrinfo error: [Errno -2] Name does not resolve
658.9 11 | 2018-04-06T19:58:03+0000 [stdout#info] getaddrinfo error: [Errno -2] Name does not resolve
659.1 11 | 2018-04-06T19:58:03+0000 [stdout#info] A query: b'host'
659.1 11 | 2018-04-06T19:58:03+0000 [stdout#info] AAAA query, sending back A instead: b'host'
659.1 11 | 2018-04-06T19:58:03+0000 [stdout#info] A query: b'host'
659.2 11 | 2018-04-06T19:58:03+0000 [stdout#info] getaddrinfo error: [Errno -2] Name does not resolve
659.2 11 | 2018-04-06T19:58:03+0000 [stdout#info] getaddrinfo error: [Errno -2] Name does not resolve
732.9 11 | 2018-04-06T19:59:17+0000 [stdout#info] A query: b'play.google.com'
732.9 11 | 2018-04-06T19:59:17+0000 [stdout#info] Result for b'play.google.com' is ['172.217.8.14']
ark3 commented 6 years ago

The curl issue seems to be IPv6-related. Can you try curl -4 to avoid IPv6 name resolution?

alexisvincent commented 6 years ago

Getting this same issue with curl and wget, curl -4 doesn't work. Insanely awesome tool btw!

alexisvincent commented 6 years ago

It would seem using the FQN resolves fine. so my-custom-service.default.svc.cluster.local but my-custom-service does not. Both work fine in dig, but the short name fails everywhere else (e.g. chrome, JVM, curl, wget)

ark3 commented 6 years ago

Sorry, the curl -4 suggestion was based on a misunderstanding on my part.

Can you send (via gist) the full telepresence.log and command line output for a simple command

telepresence --run curl -svk https://kubernetes/api/

that presumably fails when your computer is in this broken state?

One other thought... Maybe there is an mDNSResponder issue? Can you try the diagnostic portion of this StackOverflow answer?

esorey commented 6 years ago

Hi again,

Thanks for your patience; this issue is intermittent so I can't prod at it as often as I'd like. Here's the log I got from running the above command: https://gist.github.com/esorey/b50b90e9b2ddd50a909bc81d0eaa78a5

alexisvincent commented 6 years ago

And here's mine https://gist.github.com/alexisvincent/8d4e58f6222e2dbf016f98aed2b3d4dd

ark3 commented 6 years ago

@alexisvincent Thanks for the trace. You have found a bug in our fix for #192, released in Telepresence 0.81. Can you try this

TELEPRESENCE_VERSION=0.78 telepresence --run curl -svk https://kubernetes/api/

and send me the trace if it fails? I expect that one will work for you by avoiding the new bug #578.

@esorey Thanks for the trace. You're running 0.77 (I think), which does not have the bug identified above. Oddly, your trace does not have relevant portion of the kubectl logs output, which is process 10 in that trace. In any case, let me fix the regression and then ask you to try again. Thanks for your patience.

alexisvincent commented 6 years ago

The command you provided now works, however my original problem persists. I can't hit my service.

Here are the logs after running TELEPRESENCE_VERSION=0.78 telepresence --run curl -svk http://retracted-service-name

with the following std out:

Starting proxy with method 'vpn-tcp', which has the following limitations: All processes are affected, only one telepresence can run per machine, and you can't use other VPNs. You may need to add cloud hosts with --also-proxy. For a full list of method limitations see https://telepresence.io/reference/methods.html
Volumes are rooted at $TELEPRESENCE_ROOT. See https://telepresence.io/howto/volumes.html for details.

No traffic is being forwarded from the remote Deployment to your local machine. You can use the --expose option to specify which ports you want to forward.

Password:
* Rebuilt URL to: http://retracted-service-name/
*   Trying 10.63.255.122...
* TCP_NODELAY set
* Connected to retracted-service-name (127.0.0.1) port 80 (#0)
> GET / HTTP/1.1
> Host: retracted-service-name
> User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)
> Accept: */*
> Referer:
>
* Empty reply from server
* Connection #0 to host retracted-service-name left intact
ark3 commented 6 years ago

Perhaps a silly question, but does that work from the cluster?

kubectl run asdf -it --rm --image=fedora --restart=Never -- curl -svk http://retracted-service-name
alexisvincent commented 6 years ago

Yes, sorry, should have mentioned that. Works in the cluster. Also, dig retracted-service-name resolves correctly, and telepresence --run curl -svk http://retracted-service-name.default.svc.cluster.local also works.

ark3 commented 6 years ago

Do I misunderstand? You're saying curl works with the full name but not with the short name? Do they resolve to different IPs?

alexisvincent commented 6 years ago

Exactly. Both resolve to same IP with dig. But only the long one works with curl, netcat, etc.

alexisvincent commented 6 years ago

If you want to play around with it to reproduce, I'm happy to give you teamviewer access to my machine and cluster.

ark3 commented 6 years ago

That's confusing.

I'd like to release Telepresence to get the fix for #578 out there. That avoids environment shenanigans and version skew. Let's debug after that.

Thanks very much for helping me with this.

alexisvincent commented 6 years ago

Cool :) NP, this is a really awesome project 👍 I'm thinking of using it as the default user experience for folks interacting with our research cluster at Stellenbosch University. Anything we can do to lower the barrier to entry for folks.

Let me know if there's anything I can help with for the debugging process.

ark3 commented 6 years ago

@alexisvincent @esorey Can you please try again with Tel 0.82? Thanks for your help.

alexisvincent commented 6 years ago

working for me 🎉 :) Thanks

esorey commented 6 years ago

Thanks for getting this out! I'm trying it, and I'll let you know if the issue pops back up again.

esorey commented 6 years ago

Unfortunately I'm still getting this issue intermittently. Next time it happens I'll post some logs here.

ark3 commented 6 years ago

@rhs pointed out that this issue might be due to negative DNS caching. If you run your curl without Telepresence and get a DNS failure, MacOS caches that failure for a little while. During that time the curl will fail even under Telepresence because of the cache.

Can you try clearing your Mac's DNS cache? sudo killall -HUP mDNSResponder should do it. Then try your Telepresence command again.

esorey commented 6 years ago

Still no dice, unfortunately.

el-davo commented 6 years ago

Im seeing much the same as described here on Ubuntu. I have to restart my computer nearly every time in order to get telepresence to work again

esorey commented 6 years ago

Hmm, that's surprising @el-davo. My solution was to dual-boot Ubuntu; since I made the switch, this issue has disappeared for me.

amarchen commented 5 years ago

I had the same issue with my mac running Mac OSX 10.14 (Mojave). When using vpn-tcp method and communicating to the locally running docker-for-desktop kubernetes instance, hosts such as myservice.mynamespace.svc.cluster.local resolved just fine, but myservice.mynamespace were not.

Flushing Mac's own DNS service helped per https://help.dreamhost.com/hc/en-us/articles/214981288-Flushing-your-DNS-cache-in-Mac-OS-X-and-Linux , but everything was slowed down for a while after it. For recent Mac OS (Mojave at least) flushing cache is

sudo killall -HUP mDNSResponder;sudo killall mDNSResponderHelper;sudo dscacheutil -flushcache

I think, that just dscacheutil -flushcache was enough at least once, but I didn't keep records for it.

ark3 commented 5 years ago

@amarchen What is the search line in your computer's /etc/resolv.conf? Or, if you prefer not to reveal that, how many entries are there? I have some ideas around the particular failure mode you described.

amarchen commented 5 years ago

My /etc/resolve.conf is very minimal. Here's the content (I masked a exact values with "x.x" and "mycoworkingplacedomain"):

$ cat /etc/resolv.conf
#
# macOS Notice
#
# This file is not consulted for DNS hostname resolution, address
# resolution, or the DNS query routing mechanism used by most
# processes on this system.
#
# To view the DNS configuration used by this system, use:
#   scutil --dns
#
# SEE ALSO
#   dns-sd(1), scutil(8)
#
# This file is automatically generated.
#
domain mycoworkingplacedomain
nameserver 10.51.x.x
ark3 commented 5 years ago

Thanks. So, not what I was thinking, at least in your case. This needs more thought.

amarchen commented 5 years ago

Thanks. So, not what I was thinking, at least in your case. This needs more thought.

It's a pity, @ark3 Well, if you happen to figure what sort of studying would help you, I'd be glad to try that. Logs, experiments, trials - whatever you need :)