gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.58k stars 1.76k forks source link

Can not add node to cluster ? #5355

Closed huifanglu2018 closed 3 years ago

huifanglu2018 commented 3 years ago

while I run a standalone cluster ,it works fine, but it often get errors while I add another node to this cluster. one node run as auth,node,proxy:

/usr/local/bin/teleport start  --config=/etc/teleport.yaml  --pid-file=/run/teleport.pid --insecure
add another node as below:
/usr/local/bin/teleport start  --roles=node --config=/etc/teleport.yaml  --pid-file=/run/teleport.pid --insecure

it always shows:

DEBU [HTTP:PROX] No valid environment variables found. proxy/proxy.go:222
DEBU [HTTP:PROX] No proxy set in environment, returning direct dialer. proxy/proxy.go:137
ERRO [PROC:1]    Node failed to establish connection to cluster: ssh: handshake failed: no matching keys found. time/sleep.go:148

can anybody help me get node to the cluster?

webvictim commented 3 years ago

Can you share the content of the /etc/teleport.yaml files for both nodes?

huifanglu2018 commented 3 years ago
teleport:
    data_dir: /var/lib/teleport
    auth_token: f7adb7ccdf04037bcd2b52ec6010fd6f0caec94ba190b765
    auth_servers:
      - 10.10.10.136:3025
    connection_limits:
        max_connections: 1000
        max_users: 250
    log:
        output: stderr
        severity: DEBUG
auth_service:
    enabled: true
    #    session_recording: "proxy"
    proxy_checks_host_keys: no
    cluster_name: "teleportcluster"
    listen_addr: 0.0.0.0:3025
    tokens:
    - proxy,node:f7adb7ccdf04037bcd2b52ec6010fd6f0caec94ba190b765
    authentication:
      type: local
      second_factor: off
ssh_service:
    enabled: true
    labels:
        env: staging
proxy_service:
    enabled: true
    listen_addr: 0.0.0.0:3023
    web_listen_addr: 0.0.0.0:3080
    tunnel_listen_addr: 0.0.0.0:3024
    public_addr: 10.10.10.136:3080

teleport:
    data_dir: /var/lib/teleport
    auth_token: f7adb7ccdf04037bcd2b52ec6010fd6f0caec94ba190b765
    auth_servers:
            - 10.10.10.136:3080
    connection_limits:
        max_connections: 1000
        max_users: 250
    log:
        output: stderr
        severity: DEBUG
auth_service:
    enabled: false
ssh_service:
    enabled: true
    labels:
        env: staging
proxy_service:
    enabled: false

@webvictim Thanks for your help.

webvictim commented 3 years ago

@huifanglu2018 I think the reason for the issue is because the token you've set is valid for both proxy,node roles, but your node is only trying to join with the node role. With Teleport, tokens must use the full set of roles.

Change - proxy,node:f7adb7ccdf04037bcd2b52ec6010fd6f0caec94ba190b765 to - node:f7adb7ccdf04037bcd2b52ec6010fd6f0caec94ba190b765, then restart your Teleport auth server and try joining the node again - it should work.

huifanglu2018 commented 3 years ago
/home/ubuntu/teleport# /usr/local/bin/teleport start  --roles=node --config=/home/ubuntu/teleport/teleport.yaml --auth-server=https://10.254.9.135:**3080**  --pid-file=/run/teleport.pid  --insecure
DEBU [SQLITE]    Connected to: file:/var/lib/teleport/proc/sqlite.db?_busy_timeout=10000&_sync=OFF, poll stream period: 1s lite/lite.go:173
DEBU [SQLITE]    Synchronous: 0, busy timeout: 10000 lite/lite.go:218
DEBU [KEYGEN]    SSH cert authority started with no keys pre-compute. native/native.go:107
DEBU [PROC]      Adding service to supervisor. service:register.node service/supervisor.go:181
DEBU [PROC]      Adding service to supervisor. service:ssh.node service/supervisor.go:181
DEBU [PROC]      Adding service to supervisor. service:ssh.shutdown service/supervisor.go:181
DEBU [PROC]      Adding service to supervisor. service:common.rotate service/supervisor.go:181
DEBU [PROC:1]    Service has started. service:register.node service/supervisor.go:242
DEBU [PROC:1]    Service has started. service:ssh.shutdown service/supervisor.go:242
DEBU [PROC:1]    No signal pipe to import, must be first Teleport process. service/service.go:761
DEBU [PROC:1]    Service has started. service:ssh.node service/supervisor.go:242
DEBU [PROC:1]    Service has started. service:common.rotate service/supervisor.go:242
DEBU [PROC:1]    Connected state: never updated. service/connect.go:99
INFO [PROC]      Connecting to the cluster worker7 with TLS client certificate. service/connect.go:127
DEBU [PROC]      Attempting to connect to Auth Server directly. service/connect.go:793
DEBU [PROC]      Attempting to connect to Auth Server through tunnel. service/connect.go:801
DEBU [CLIENT]    HTTPS client init(proxyAddr=10.254.9.135:3080, insecure=true) client/weblogin.go:307
WARNING: You are using insecure connection to SSH proxy https://10.254.9.135:3080
DEBU [PROC]      Discovered address for reverse tunnel server: 10.254.9.135:3024. service/connect.go:881
DEBU [HTTP:PROX] No valid environment variables found. proxy/proxy.go:222
DEBU [HTTP:PROX] No proxy set in environment, returning direct dialer. proxy/proxy.go:137
ERRO [PROC:1]    Node failed to establish connection to cluster: ssh: handshake failed: no matching keys found. time/sleep.go:148

@webvictim Thanks so much for your reply, but I was so sad it still so many issues. It still same error after I change it to node:f7adb7ccdf04037bcd2b52ec6010fd6f0caec94ba190b765 . and I find if I change the auth server to 10.254.9.135:3025 , it shows log as below: node log:

/home/ubuntu/teleport# /usr/local/bin/teleport start  --roles=node --config=/home/ubuntu/teleport/teleport.yaml --auth-server=https://10.254.9.135:3025  --pid-file=/run/teleport.pid  --insecure --insecure-no-tls
DEBU [SQLITE]    Connected to: file:/var/lib/teleport/proc/sqlite.db?_busy_timeout=10000&_sync=OFF, poll stream period: 1s lite/lite.go:173
DEBU [SQLITE]    Synchronous: 0, busy timeout: 10000 lite/lite.go:218
DEBU [KEYGEN]    SSH cert authority started with no keys pre-compute. native/native.go:107
DEBU [PROC]      Adding service to supervisor. service:register.node service/supervisor.go:181
DEBU [PROC]      Adding service to supervisor. service:ssh.node service/supervisor.go:181
DEBU [PROC]      Adding service to supervisor. service:ssh.shutdown service/supervisor.go:181
DEBU [PROC]      Adding service to supervisor. service:common.rotate service/supervisor.go:181
DEBU [PROC:1]    Service has started. service:ssh.shutdown service/supervisor.go:242
DEBU [PROC:1]    Service has started. service:ssh.node service/supervisor.go:242
DEBU [PROC:1]    Service has started. service:common.rotate service/supervisor.go:242
DEBU [PROC:1]    No signal pipe to import, must be first Teleport process. service/service.go:761
DEBU [PROC:1]    Service has started. service:register.node service/supervisor.go:242
DEBU [PROC:1]    Connected state: never updated. service/connect.go:99
INFO [PROC]      Connecting to the cluster worker7 with TLS client certificate. service/connect.go:127
DEBU [PROC]      Attempting to connect to Auth Server directly. service/connect.go:793
DEBU [PROC]      Attempting to connect to Auth Server through tunnel. service/connect.go:801
DEBU [CLIENT]    HTTPS client init(proxyAddr=10.254.9.135:3025, insecure=true) client/weblogin.go:307
WARNING: You are using insecure connection to SSH proxy https://10.254.9.135:3025
ERRO [PROC:1]    "Node failed to establish connection to cluster: 404 page not found\n." time/sleep.go:148

auth log:

huifanglu2018 commented 3 years ago

auth log:

ERRO [AUTH:1]    "Failed to retrieve client pool. Client cluster worker7, target cluster teleportcluster, error:  \nERROR REPORT:\nOriginal Error: *trace.NotFoundError key \"/authorities/host/worker7\" is not found\nStack Trace:\n\t/go/src/github.com/gravitational/teleport/lib/backend/memory/memory.go:186 github.com/gravitational/teleport/lib/backend/memory.(*Memory).Get\n\t/go/src/github.com/gravitational/teleport/lib/backend/report.go:159 github.com/gravitational/teleport/lib/backend.(*Reporter).Get\n\t/go/src/github.com/gravitational/teleport/lib/backend/wrap.go:89 github.com/gravitational/teleport/lib/backend.(*Wrapper).Get\n\t/go/src/github.com/gravitational/teleport/lib/services/local/trust.go:207 github.com/gravitational/teleport/lib/services/local.(*CA).GetCertAuthority\n\t/go/src/github.com/gravitational/teleport/lib/cache/cache.go:892 github.com/gravitational/teleport/lib/cache.(*Cache).GetCertAuthority\n\t/go/src/github.com/gravitational/teleport/lib/auth/middleware.go:546 github.com/gravitational/teleport/lib/auth.ClientCertPool\n\t/go/src/github.com/gravitational/teleport/lib/auth/middleware.go:253 github.com/gravitational/teleport/lib/auth.(*TLSServer).GetConfigForClient\n\t/opt/go/src/crypto/tls/handshake_server.go:141 crypto/tls.(*Conn).readClientHello\n\t/opt/go/src/crypto/tls/handshake_server.go:40 crypto/tls.(*Conn).serverHandshake\n\t/opt/go/src/crypto/tls/conn.go:1362 crypto/tls.(*Conn).Handshake\n\t/go/src/github.com/gravitational/teleport/lib/multiplexer/tls.go:141 github.com/gravitational/teleport/lib/multiplexer.(*TLSListener).detectAndForward\n\t/opt/go/src/runtime/asm_amd64.s:1375 runtime.goexit\nUser Message: key \"/authorities/host/worker7\" is not found\n." auth/middleware.go:261
WARN [MXTLS:1]   Handshake failed. error:remote error: tls: bad certificate multiplexer/tls.go:143
huifanglu2018 commented 3 years ago

The guild seems not to work for me, I try so many possible mistake, it still cannot add the second node to cluster... https://goteleport.com/teleport/docs/quickstart/#add-a-node-to-the-cluster Is there a example configure for me ?

webvictim commented 3 years ago

@huifanglu2018 You may have some old credentials cached for some reason. Look at ps -ef | grep teleport and make sure there are no other Teleport processes running on the node you're trying to add, remove /var/lib/teleport completely and then run the teleport start command again.

huifanglu2018 commented 3 years ago

@webvictim Thank you so much! It works after removing /var/lib/teleport. 👍

suchisur commented 1 year ago

@webvictim Jul 12 09:18:28 ip-10-25-1-55 teleport[7630]: 2023-07-12T09:18:28Z ERRO [PROC:1] Node failed to establish connection to cluster: Failed to connect to Proxy Server through tunnel: connection error: desc = "transport: Error while dialing: failed to dial: ssh: handshake failed: read tcp 10.25.1.55:56066->10.25.0.212:3024: i/o timeout". pid:7630.1 service/connect.go:123

These are error logs from a node trying to join the cluster, tried rempcinv /var/lib/teleport and restarting but it did not help. What could be the possible solution for this? Trying to set teleport up in HA environment with Load balancers>

webvictim commented 1 year ago

@suchisur It looks like you don't have TLS routing enabled in your cluster, so agents are trying to join over the traditional reverse tunnel port (3024).

You should either:

longansv commented 8 months ago

Hi @webvictim These are error logs from a node trying to join the cluster. Please help me

_[root@jumpserver03 ~]# systemctl status teleport -l
● teleport.service - Teleport Service
   Loaded: loaded (/usr/lib/systemd/system/teleport.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2024-03-11 11:11:50 +07; 4min 55s ago
 Main PID: 3357 (teleport)
   CGroup: /system.slice/teleport.service
           └─3357 /usr/local/bin/teleport start --config /etc/teleport.yaml --pid-file=/run/teleport.pid

Mar 11 11:15:28 jumpserver03.vascloud.vnpt.vn teleport[3357]: 2024-03-11T11:15:28+07:00 INFO [AUTH]      Attempting registration via proxy server. auth/register.go:288
Mar 11 11:15:28 jumpserver03.vascloud.vnpt.vn teleport[3357]: 2024-03-11T11:15:28+07:00 ERRO [PROC:1]    Node failed to establish connection to cluster: Post "https://longhd98.click:443/v1/webapi/host/credentials": tls: failed to verify certificate: x509: certificate signed by unknown authority. pid:3357.1 service/connect.go:91
Mar 11 11:15:48 jumpserver03.vascloud.vnpt.vn teleport[3357]: 2024-03-11T11:15:48+07:00 INFO [PROC:1]    Joining the cluster with a secure token. pid:3357.1 service/connect.go:417
Mar 11 11:15:48 jumpserver03.vascloud.vnpt.vn teleport[3357]: 2024-03-11T11:15:48+07:00 INFO [AUTH]      Attempting registration via proxy server. auth/register.go:288
Mar 11 11:15:48 jumpserver03.vascloud.vnpt.vn teleport[3357]: 2024-03-11T11:15:48+07:00 ERRO [PROC:1]    Instance failed to establish connection to cluster: Post "https://longhd98.click:443/v1/webapi/host/credentials": tls: failed to verify certificate: x509: certificate signed by unknown authority. pid:3357.1 service/connect.go:91
Mar 11 11:15:50 jumpserver03.vascloud.vnpt.vn teleport[3357]: 2024-03-11T11:15:50+07:00 WARN [UPLOAD:1]  The Instance connector is still not available, process-wide services such as session uploading will not function pid:3357.1 service/service.go:2863
Mar 11 11:16:17 jumpserver03.vascloud.vnpt.vn teleport[3357]: 2024-03-11T11:16:17+07:00 INFO [PROC:1]    Joining the cluster with a secure token. pid:3357.1 service/connect.go:417
Mar 11 11:16:17 jumpserver03.vascloud.vnpt.vn teleport[3357]: 2024-03-11T11:16:17+07:00 INFO [AUTH]      Attempting registration via proxy server. auth/register.go:288
Mar 11 11:16:17 jumpserver03.vascloud.vnpt.vn teleport[3357]: 2024-03-11T11:16:17+07:00 ERRO [PROC:1]    Node failed to establish connection to cluster: Post "https://longhd98.click:443/v1/webapi/host/credentials": tls: failed to verify certificate: x509: certificate signed by unknown authority. pid:3357.1 service/connect.go:91
Mar 11 11:16:20 jumpserver03.vascloud.vnpt.vn teleport[3357]: 2024-03-11T11:16:20+07:00 WARN [UPLOAD:1]  The Instance connector is still not available, process-wide services such as session uploading will not function pid:3357.1 service/service.go:2863_
webvictim commented 8 months ago

It seems like the certificate being presented by your proxy server is not trusted. If it's not from a trusted CA, you probably want to add this to a unit file override as root:

cat <<EOF> /etc/systemd/system/teleport.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/local/bin/teleport start --config /etc/teleport.yaml --pid-file=/run/teleport.pid --insecure
EOF
systemctl daemon-reload
systemctl restart teleport

Note that this isn't secure and shouldn't be used in production, just for testing. The correct way to fix this is to get a TLS certificate from a trusted CA on your Teleport proxy.

longansv commented 8 months ago

That's right I'm using a self-signed certificate. Do I have to add the above configuration to the teleport server or remote node or both?

webvictim commented 8 months ago

This configuration would be on any remote node/agent joining the Teleport cluster.

The more secure alternative is to install the public key of the issuing CA or self-signed certificate onto each of these joining servers: https://ubuntu.com/server/docs/security-trust-store

jdeletronica commented 6 months ago

enquanto executa um cluster independente, ele funciona bem, mas geralmente ocorre erros enquanto adiciona outro nĂł a esse cluster. um nĂł executado como auth,node,proxy:

/usr/local/bin/teleport start  --config=/etc/teleport.yaml  --pid-file=/run/teleport.pid --insecure
add another node as below:
/usr/local/bin/teleport start  --roles=node --config=/etc/teleport.yaml  --pid-file=/run/teleport.pid --insecure

sempre mostra:

DEBU [HTTP:PROX] No valid environment variables found. proxy/proxy.go:222
DEBU [HTTP:PROX] No proxy set in environment, returning direct dialer. proxy/proxy.go:137
ERRO [PROC:1]    Node failed to establish connection to cluster: ssh: handshake failed: no matching keys found. time/sleep.go:148

alguém pode me ajudar a obter o nó para o cluster?

digite esse comando no proxmox

comando= nano /etc/hostname e /etc/hosts

vai abrir um arquivo e dentro desse arquivo vai estar escrito o nome=pve

altere para pve2 ou pve3, ou pve4 de sua preferĂȘncia porque ele gera conflito com o servidor master pve

wszychta commented 4 months ago

@webvictim I know that this is an old topic, but I have issue with only single instance and I'm out of options for self debuging. Whole cluster has version 15.4.4. This is the log of teleport node:

024-06-18T11:56:32Z INFO [PROC:1]    Generating new host UUID pid:1310.1 host_uuid:fd675f54-6dff-4d57-92d8-39e1e02f2233 service/service.go:6216
2024-06-18T11:56:33Z INFO [PROC:1]    Service is creating new listener. pid:1310.1 type:diag address:127.0.0.1:3000 service/signals.go:249
2024-06-18T11:56:33Z INFO [DIAG:1]    Starting diagnostic service. pid:1310.1 listen_address:127.0.0.1:3000 service/service.go:3364
2024-06-18T11:56:33Z INFO [PROC:1]    Service is creating new listener. pid:1310.1 type:debug address:/var/lib/teleport/debug.sock service/signals.go:249
2024-06-18T11:56:33Z INFO [PROC:1]    Joining the cluster with a secure token. pid:1310.1 service/connect.go:464
2024-06-18T11:56:33Z INFO [PROC:1]    Joining the cluster with a secure token. pid:1310.1 service/connect.go:464
2024-06-18T11:56:33Z INFO             Attempting registration via proxy server. join/join.go:253
2024-06-18T11:56:33Z INFO             Attempting registration via proxy server. join/join.go:253
2024-06-18T11:56:33Z INFO             Successfully registered via proxy server. join/join.go:260
2024-06-18T11:56:33Z INFO [PROC:1]    Successfully obtained credentials to connect to the cluster. pid:1310.1 identity:App service/connect.go:524
2024-06-18T11:57:03Z WARN [UPLOAD:1]  The Instance connector is still not available, process-wide services such as session uploading will not function pid:1310.1 service/service.go:3024
2024-06-18T11:57:33Z WARN [UPLOAD:1]  The Instance connector is still not available, process-wide services such as session uploading will not function pid:1310.1 service/service.go:3024
2024-06-18T11:58:03Z WARN [UPLOAD:1]  The Instance connector is still not available, process-wide services such as session uploading will not function pid:1310.1 service/service.go:3024

And that log goes forever. I have replaced this VM and I have also checked for configuration and token mistakes. I have replaced token with a new one. Also I have removed directory /var/lib/teleport and then started again service. Other instances with the same token are working fine. The main difference is in the APP configuration. I'm providing node configuration:

version: v3
teleport:
  nodename: sdffsdgfsfdgdf
  join_params:
    token_name: !!!correct_token!!!
    method: token
  cache:
    enabled: yes
    max_backoff: 5m
  proxy_server: X.X.X.X:443
  data_dir: /var/lib/teleport
  log:
    output: /var/log/teleport/teleport.log
    severity: INFO
    format:
      output: text
  ca_pin: ""
  diag_addr: "127.0.0.1:3000"
proxy_service:
  enabled: False
auth_service:
  enabled: False
app_service:
  apps:
  - insecure_skip_verify: true
    labels:
      app: grafana
      env: internal
    name: internal-grafana
    public_addr: grafana.XXXXX.xyz
    rewrite:
      headers:
      - 'Origin: https://grafana.XXXXX.xyz'
      - 'Host: grafana.XXXXX.xyz'
      jwt_claims: roles
      redirect:
      - localhost
      - 10.109.0.3
    uri: http://10.109.0.3:8081
  enabled: true
db_service:
  enabled: false
ssh_service:
  enabled: false

Systemd service:

[Unit]
Description=Teleport Service
After=network.target

[Service]
Type=simple
Restart=on-failure
EnvironmentFile=-/etc/default/teleport
ExecStart=/usr/local/bin/teleport start --insecure --config /etc/teleport.yaml --pid-file=/run/teleport.pid
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/run/teleport.pid
LimitNOFILE=524288

[Install]
WantedBy=multi-user.target

Can you help me with that? Do you have any suggestions where to look for the issue with this VM? Maybe there is an issue in auth service or some strange record in postgres database which we are using for auth? I have also noticed that on the others nodes there is a directory called /var/log/teleport/log/upload which is not created on failed node after restart.

webvictim commented 4 months ago

@wszychta Can you share the listing of tctl get token/!!!correct_token!!!? You can redact the token itself, I just want to see the extra information associated with it.

wszychta commented 4 months ago

@webvictim I can provide it, but I have found working solution. Instead of using IP address with --insecure we have configure agents to go via public teleport endpoint.

Funny fact is that when I have switched back (to IP address and --insecure) it was still working fine. When I will face the same issue, then I will provide what you ask.

webvictim commented 4 months ago

That sounds like a bug to me, I'll see if I can reproduce it.

If your goal is to join agents via a private address rather than the public address, you could try one of the workarounds detailed in this comment: https://github.com/gravitational/teleport/issues/27885#issue-1758924661

Using --insecure is never a good idea!

wszychta commented 4 months ago

@webvictim I was able to reproduce this issue. As suggested I have run tctl get token/!!!correct_token!!! with bellow result:

tctl get token/!!!correct_token!!!
ERROR: provisioning token(************************token!!!) not found

Also I'm passing part of the result of the command tctl token ls

❯ tctl tokens ls
Token                            Type        Labels Expiry Time (UTC)                   
-------------------------------- ----------- ------ ----------------------------------- 
!!!correct_token!!! Node,App,Db        01 Jan 70 00:00 UTC (-477630h56m4s) 

This token is correct and created in auth server config file. Again changing public IP address of the Load Balancer to public hostname has solved issue with connecting node to proxy server. As a reminder we are using teleport version 15.4.4.