gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.52k stars 1.75k forks source link

Unable to join cluster over port 3024 #22741

Closed programmerq closed 1 year ago

programmerq commented 1 year ago

Expected behavior:

After https://github.com/gravitational/teleport/pull/13598 was merged, I would expect to be able to have a tunnel mode agent join when specifying the port 3024 tunnel service as the proxy_server. (I did my test with ssh_service, but the issue is seen with any other tunnel mode service, including kube_service, db_service, etc...)

Current behavior:

When attempting to join with the following teleport.yaml:

version: v3
teleport:
  join_params:
    token_name: mytoken
    method: token
  proxy_server: other.proving.cf:3024
auth_service:
  enabled: "no"
ssh_service:
  enabled: "yes"

The node is unable to join with the following error:

2023-03-07T21:01:02Z INFO [PROC:1]    Joining the cluster with a secure token. pid:7.1 service/connect.go:585
2023-03-07T21:01:02Z INFO [AUTH]      Attempting registration via proxy server. auth/register.go:251
2023-03-07T21:01:02Z ERRO [PROC:1]    Instance failed to establish connection to cluster: Post "https://teleport.example.com:3024/v1/webapi/host/credentials": tls: first record does not look like a TLS handshake. pid:7.1 service/connect.go:119
2023-03-07T21:01:12Z INFO [PROC:1]    Joining the cluster with a secure token. pid:7.1 service/connect.go:585
2023-03-07T21:01:12Z INFO [AUTH]      Attempting registration via proxy server. auth/register.go:251
2023-03-07T21:01:12Z ERRO [PROC:1]    Node failed to establish connection to cluster: Post "https://teleport.example.com:3024/v1/webapi/host/credentials": tls: first record does not look like a TLS handshake. pid:7.1 service/connect.go:119

I believe what might be happening is that the SSH banner is sent to the TLS client before it sees the TLS Client hello. In other cases where SSH and TLS are running on the same port, it is usually necessary to send the ssh banner after a small delay-- usually 500-2000 ms. That gives a TLS client a chance to send its TLS client hello. If it is an ssh client that is trying to connect, the ssh client simply sees a short delay before the server sends the banner.

wireshark packet capture info I ran a packet capture while trying to do a join. This is the exported view of the packets in my wireshark user interface. In particular, packet 13 shows the TCP client hello, and then packet 20 (39 ms later) shows the SSH banner coming from the remote side. |No.|Time |time |Source |Destination |Protocol|Length|TCP stream|TTL|Info |SNI |ALPN|cert name| |---|--------|---------------|--------------------|--------------------|--------|------|----------|---|-----------------------------------------------------------------------------------------------------------------|--------------------|----|---------| |10 |0.839476|21:11:19.348592|172.17.0.3 |teleport.example.com|TCP |74 |2 |64 |58240 > 3024 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=1032280152 TSecr=0 WS=128 | | | | |11 |0.880791|21:11:19.389907|teleport.example.com|172.17.0.3 |TCP |74 |2 |63 |3024 > 58240 [SYN, ACK] Seq=0 Ack=1 Win=65160 Len=0 MSS=1460 SACK_PERM TSval=3549586368 TSecr=1032280152 WS=128| | | | |12 |0.880867|21:11:19.389983|172.17.0.3 |teleport.example.com|TCP |66 |2 |64 |58240 > 3024 [ACK] Seq=1 Ack=1 Win=64256 Len=0 TSval=1032280193 TSecr=3549586368 | | | | |13 |0.881420|21:11:19.390536|172.17.0.3 |teleport.example.com|TLSv1 |340 |2 |64 |Client Hello |teleport.example.com| | | |14 |0.881451|21:11:19.390567|teleport.example.com|172.17.0.3 |TCP |66 |2 |63 |3024 > 58240 [ACK] Seq=1 Ack=275 Win=64896 Len=0 TSval=3549586369 TSecr=1032280194 | | | | |20 |0.920580|21:11:19.429696|teleport.example.com|172.17.0.3 |SSH |84 |2 |63 |Server: Protocol (SSH-2.0-Teleport) | | | | |21 |0.920587|21:11:19.429703|172.17.0.3 |teleport.example.com|TCP |66 |2 |64 |58240 > 3024 [ACK] Seq=275 Ack=19 Win=64256 Len=0 TSval=1032280233 TSecr=3549586408 | | | | |22 |0.920856|21:11:19.429972|172.17.0.3 |teleport.example.com|TCP |66 |2 |64 |58240 > 3024 [FIN, ACK] Seq=275 Ack=19 Win=64256 Len=0 TSval=1032280233 TSecr=3549586408 | | | | |23 |0.922187|21:11:19.431303|teleport.example.com|172.17.0.3 |TCP |66 |2 |63 |3024 > 58240 [FIN, ACK] Seq=19 Ack=276 Win=64896 Len=0 TSval=3549586409 TSecr=1032280233 | | | | |24 |0.922194|21:11:19.431310|172.17.0.3 |teleport.example.com|TCP |66 |2 |64 |58240 > 3024 [ACK] Seq=276 Ack=20 Win=64256 Len=0 TSval=1032280234 TSecr=3549586409 | | | | A packet capture taken on the remote side (different event, same result) shows that the remote end sends the SSH banner immediately after the TCP connection is established, and only sees the TLS client handshake after it has already sent the banner (see packet 108 and then 110): |No.|Time |time |Source |Destination |Protocol|Length|TCP stream|TTL|Info |SNI |ALPN|cert name| |---|--------|---------------|--------------------|--------------------|--------|------|----------|---|-----------------------------------------------------------------------------------------------------------------|--------------------|----|---------| |103|2.433731|21:24:32.979021|192.168.68.200 |192.168.75.200 |TCP |74 |7 |253|55376 > 3024 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM TSval=2959330896 TSecr=0 WS=256 | | | | |104|2.433747|21:24:32.979037|192.168.75.200 |192.168.68.200 |TCP |74 |7 |255|3024 > 55376 [SYN, ACK] Seq=0 Ack=1 Win=62643 Len=0 MSS=8961 SACK_PERM TSval=3469493350 TSecr=2959330896 WS=128| | | | |105|2.434550|21:24:32.979840|192.168.68.200 |192.168.75.200 |TCP |66 |7 |253|55376 > 3024 [ACK] Seq=1 Ack=1 Win=27136 Len=0 TSval=2959330897 TSecr=3469493350 | | | | |106|2.434602|21:24:32.979892|192.168.68.200 |192.168.75.200 |PROXYv1 |117 |7 |253|55376 > 3024 [PSH, ACK] Seq=1 Ack=1 Win=27136 Len=51 TSval=2959330897 TSecr=3469493350 | | | | |107|2.434607|21:24:32.979897|192.168.75.200 |192.168.68.200 |TCP |66 |7 |255|3024 > 55376 [ACK] Seq=1 Ack=52 Win=62592 Len=0 TSval=3469493351 TSecr=2959330897 | | | | |108|2.434765|21:24:32.980055|192.168.75.200 |192.168.68.200 |SSH |84 |7 |255|Server: Protocol (SSH-2.0-Teleport) | | | | |109|2.435494|21:24:32.980784|192.168.68.200 |192.168.75.200 |TCP |66 |7 |253|55376 > 3024 [ACK] Seq=52 Ack=19 Win=27136 Len=0 TSval=2959330898 TSecr=3469493351 | | | | |110|2.436997|21:24:32.982287|192.168.68.200 |192.168.75.200 |TLSv1 |340 |7 |253|Client Hello |teleport.example.com| | | |111|2.437315|21:24:32.982605|192.168.75.200 |192.168.68.200 |TCP |66 |7 |255|3024 > 55376 [RST, ACK] Seq=19 Ack=326 Win=62336 Len=0 TSval=3469493354 TSecr=2959330900 | | | |

I don't know that much about how the integration test for this feature works, but I'm afraid that it isn't bringing up a full blown TCP stack for the integration test seen at https://github.com/gravitational/teleport/blob/0b3a67f69fc371ac85880ec4655a623b0e44aa13/integration/integration_test.go#L7209-L7237 that tests this feature. If it is, I wonder if it is simply a race condition that is difficult to hit with the node and proxy service both running as objects in memory, but is easy to hit with normal network latency.

Bug details:

atburke commented 1 year ago

Does the proxy service have proxy_protocol: on set? If it does, this might be a duplicate of #21353.

programmerq commented 1 year ago

I did have proxy_protocol: on set in the teleport.yaml for proxy_service. I agree this is a duplicate of #21353. I'll close this in favor of that one, and the PR set to resolve it!