Reverse tunnel discovery behind multiple load balancers

inve1 commented 3 years ago

Summary

I'm trying to set up a cluster in 2 separate AWS regions to provide HA against an AWS regional outage. The machines I'm trying to access are located in various separate locations, not on the cloud so theoretically if 1 AWS region was having issues they'd be able to connect to the healthy one

I need to use the reverse tunnel functionality due to the network environment I'm in, and that's causing my issue

Relevant information

I've deployed something very similar to the HA cluster terraform example. Basically deploy every resource there in 2 regions, and set up dynamodb replication to share state

This works fine, and when I connect machines to the cluster they show up in both places. I can add a route53 latency based record to send users of the webui to the nearest proxy ALB with a health check in case of issues. 👍

The problem I'm having is I need to use the reverse tunnel feature for all the machines I'm dealing with, since they're deployed in locations where I have no control over the network (no exposing ports to outside traffic etc)

Unfortunately, with the current version of teleport (5.1.0) the reverse tunnel discovery will only connect to one tunnel_public_addr, even if there are proxies with different ones in the cluster (different network load balancers in different zones). The one returned by whatever proxy can be reached first through the "global" dns is used.

I've tried configuring the 2 sets of proxies to have different tunnel_public_addr values, since they're behind separate NLBs. (just set tunnel_public_addr to the local NLB DNS, as the aforementioned example does). The problem is the way the discovery protocol works now, it always uses the address it finds here: [1] [2]

The proxy gossip messages do work and I am seeing 4 proxies listed (2 in each region) but the discovery code always connects to the tunnel address that was found in that first API call, which will be 1 load balancer that will only ever connect to 2 out of 4 proxies.

I can get around this issue by creating a single DNS record that has the IPs of both load balancers (they have 2 each, so 4 answers in this record) and set that as tunnel_public_addr on every proxy. I tried this and it works but it relies on resolving the record multiple times, getting "lucky" to see ips from both load balancers at some point (therefore I cannot use a route53 geographical or latency record, it'd always return the same LB to the same client) and when hitting the load balancer finding all the proxies on that region. Also you can't return the IPs of 2 aliases at the same time in route53, so I'd have to set up something to keep this record up to date 🙃

It seems like a way to handle this would be to add tunnel_public_addr to each proxy on the Proxies list in the discoveryRequest and use that when seeking a missing proxy. I think. I went through that code trying to see if there was a way to configure this to do what I want but it doesn't seem like it (please correct me if I'm wrong!)

I see @awly you worked on this last and I took a look at this this PR https://github.com/gravitational/teleport/pull/4290 .. there are comments about multiple proxies, maybe you can chime in and let me know if this is a bad idea, perhaps give me some pointers on creating a fix for this and I would be interested in contributing that

Environment

If it helps, teleport yaml on the client machine/IOT-ish device looks like this:

teleport:
  nodename: "xxx"
  auth_token: "xxxx"
  ca_pin: "xxxx"
  auth_servers:
    - "my-latency-based-record.example.net:443"
auth_service:
  enabled: no
ssh_service:
  enabled: yes
proxy_service:
  enabled: no

and on the proxies:

teleport:
  ca_pin: xxx
  nodename: ip-172-31-2-63-us-east-2-compute-internal
  advertise_ip: 172.31.2.63
  log:
    output: syslog
    severity: DEBUG

  data_dir: /var/lib/teleport
  storage:
    type: dir
    path: /var/lib/teleport/backend
  auth_servers:
    - xxxxx.elb.us-east-2.amazonaws.com:3025 # this is us-west-2 in proxies running in that region

auth_service:
  enabled: no

ssh_service:
  enabled: no

proxy_service:
  enabled: yes
  listen_addr: 0.0.0.0:3023
  tunnel_listen_addr: 0.0.0.0:3024
  web_listen_addr: 0.0.0.0:3080
  public_addr: my-latency-based-record.example.net:443 # this is the same in all regions
  ssh_public_addr: proxy-nlb.elb.us-east-2.amazonaws.com:3023 # this is us-west-2 in proxies running in that region
  tunnel_public_addr: proxy-nlb.elb.us-east-2.amazonaws.com:3024 # this is us-west-2 in proxies running in that region
  # tunnel_public_addr: teleport-proxy-tunnel.example.net:3024  <-- if I use this in all proxies and set it to answer with the IPs of both proxy-nlb.elb.us-east-2.amazonaws.com and proxy-nlb.elb.us-west-2.amazonaws.com, things work.. sort of

Thanks in advance for any help

awly commented 3 years ago

cc @fspmarshall

webvictim commented 3 years ago

Out of interest, what happens if you do this? I'm presuming it will still only connect to the first in the list.

On us-east-2:

tunnel_public_addr: ['proxy-nlb.elb.us-east-2.amazonaws.com:3024', 'proxy-nlb.elb.us-west-2.amazonaws.com:3024']

On us-west-2:

tunnel_public_addr: ['proxy-nlb.elb.us-west-2.amazonaws.com:3024', 'proxy-nlb.elb.us-east-2.amazonaws.com:3024']

inve1 commented 3 years ago

Yup I tried that and as you're guessing it only used the first value. Guessing this is why: https://github.com/gravitational/teleport/blob/v5.1.0/lib/service/service.go#L2402

gravitational / teleport