linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.6k stars 1.27k forks source link

MariaDB Galera gcomm protocol does not work with opaque linkerd-edge-21.1.3 #5648

Closed barkardk closed 3 years ago

barkardk commented 3 years ago

What is the issue?

MariaDB client connection over port 3306 and galera port 4444 are meshed perfectly using the new opaque annotation Galera connectivity breaks when these 2 ports are meshed 4567 is reserved for Galera Cluster replication traffic. 4568 is the port for Incremental State Transfer.

How can it be reproduced?

Setup a vanilla MariaDB galera cluster using bitnami helm chart Mesh the cluster with the following annotation

     config.linkerd.io/opaque-ports: "4568, 4444,4567,3306"

Logs, error output, etc

MariaDB galera

2021-02-01 23:42:58 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
     at gcomm/src/pc.cpp:connect():160
2021-02-01 23:42:58 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
2021-02-01 23:42:58 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1632: Failed to open channel 'galera' at 'gcomm://maria-mb-mariadb-galera-headless.default.svc.cluster.local': -110 (Connection timed out)
2021-02-01 23:42:58 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2021-02-01 23:42:58 0 [ERROR] WSREP: wsrep::connect(gcomm://maria-mb-mariadb-galera-headless.default.svc.cluster.local) failed: 7
2021-02-01 23:42:58 0 [ERROR] Aborting

linkerd-proxy

[    33.160529s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54426 target.addr=10.96.0.105:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[    33.160673s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54426 target.addr=10.96.0.105:4567}: linkerd_detect: Detected protocol=None elapsed=3.78507ms
[    33.160890s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[    33.160976s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.0.105:4567
[    33.161332s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:54432 keepalive=Some(10s)
[    33.161426s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[    33.162021s]  INFO ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54426 target.addr=10.96.0.105:4567}: linkerd_app_core::serve: Connection closed error=server: Transport endpoint is not connected (os error 107)
[    37.209900s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:44964 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=117
[    37.210226s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:44964 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[    38.026182s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54750 target.addr=10.96.0.105:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[    38.026219s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54750 target.addr=10.96.0.105:4567}: linkerd_detect: Detected protocol=None elapsed=3.001404001s
[    38.026294s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[    38.026309s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.0.105:4567
[    38.026697s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:56448 keepalive=Some(10s)
[    38.026741s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[    38.027264s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54750 target.addr=10.96.0.105:4567}: linkerd_app_core::serve: Connection closed
[    41.011166s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:47324 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[    41.011369s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:47324 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[    42.028795s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:56804 target.addr=10.96.0.105:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[    42.028832s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:56804 target.addr=10.96.0.105:4567}: linkerd_detect: Detected protocol=None elapsed=3.002261786s
[    42.029022s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[    42.029039s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.0.105:4567
[    42.030778s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:58150 keepalive=Some(10s)
[   352.706744s] DEBUG ThreadId(01) poll_profile: linkerd_service_profiles::client: profile received: DestinationProfile { fully_qualified_name: "", opaque_protocol: false, routes: [], retry_budget: Some(RetryBudget { retry_ratio: 0.2, min_retries_per_second: 10, ttl: Some(Duration { seconds: 10, nanos: 0 }) }), dst_overrides: [], endpoint: Some(WeightedAddr { addr: Some(TcpAddress { ip: Some(IpAddress { ip: Some(Ipv4(174064242)) }), port: 4567 }), weight: 10000, metric_labels: {"statefulset": "maria-mb-mariadb-galera", "serviceaccount": "default", "pod": "maria-mb-mariadb-galera-0", "namespace": "default"}, tls_identity: None, protocol_hint: None, authority_override: None }) }
[   352.719642s] DEBUG ThreadId(01) dst: linkerd_dns: resolve_a name=linkerd-dst-headless.linkerd.svc.cluster.local
[   352.741682s] DEBUG ThreadId(01) dst: linkerd_proxy_dns_resolve: addrs=[10.96.2.99:8086]
[   355.704747s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[   355.704794s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}: linkerd_detect: Detected protocol=None elapsed=2.997837612s
[   355.704811s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}: linkerd_cache: Caching new service target=(None, Logical { orig_dst: 10.96.2.114:4567, protocol: (), profile: Some(..) })
[   355.704958s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[   355.704971s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.2.114:4567
[   355.706766s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.79:54284 keepalive=Some(10s)
[   355.706801s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[   355.707402s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}: linkerd_app_core::serve: Connection closed
[   355.707431s] DEBUG ThreadId(01) evict{target=(None, Logical { orig_dst: 10.96.2.114:4567, protocol: (), profile: Some(..) })}: linkerd_cache: Awaiting idleness
[   355.707456s] DEBUG ThreadId(01) evict{target=Accept { orig_dst: 10.96.2.114:4567, protocol: () }}: linkerd_cache: Awaiting idleness
[   356.215757s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:38092 target.addr=10.96.1.79:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[   356.215986s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:38092 target.addr=10.96.1.79:4191}: linkerd_app_core::serve: Connection closed

If I skip port 4567 and keep the rest in opaque

        config.linkerd.io/opaque-ports: 3306,4444,4568
        config.linkerd.io/proxy-log-level: linkerd=debug,warn
        config.linkerd.io/skip-inbound-ports: "4567"
        config.linkerd.io/skip-outbound-ports: "4567"

I get this linkerd-proxy

[   358.027677s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:34574 keepalive=Some(10s)
[   358.027734s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[   358.028158s]  INFO ThreadId(01) outbound:accept{peer.addr=10.96.1.77:47134 target.addr=10.96.2.114:4567}: linkerd_app_core::serve: Connection closed error=server: Transport endpoint is not connected (os error 107)
[   361.011275s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:48768 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[   361.011473s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:48768 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   362.027728s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:47702 target.addr=10.96.2.114:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[   362.027767s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:47702 target.addr=10.96.2.114:4567}: linkerd_detect: Detected protocol=None elapsed=3.000316309s
[   362.027837s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[   362.027858s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.2.114:4567
[   362.028277s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:37212 keepalive=Some(10s)
[   362.028301s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[   362.028867s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:47702 target.addr=10.96.2.114:4567}: linkerd_app_core::serve: Connection closed
[   366.028201s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:48944 target.addr=10.96.2.114:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[   366.028238s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:48944 target.addr=10.96.2.114:4567}: linkerd_detect: Detected protocol=None elapsed=3.000246505s
[   366.028309s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[   366.028339s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.2.114:4567
[   366.028774s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:38888 keepalive=Some(10s)
[   366.028804s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[   366.029179s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:48944 target.addr=10.96.2.114:4567}: linkerd_app_core::serve: Connection closed
[   367.209485s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:58334 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=117
[   367.209661s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:58334 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   368.032129s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:53830 target.addr=10.96.2.114:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[   368.032171s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:53830 target.addr=10.96.2.114:4567}: linkerd_detect: Detected protocol=None elapsed=503.559811ms
[   368.032242s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[   368.032260s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.2.114:4567
[   368.032759s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:39062 keepalive=Some(10s)
[   368.032789s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[   368.033258s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:53830 target.addr=10.96.2.114:4567}: linkerd_app_core::serve: Connection closed
[   371.011231s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:48878 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[   371.011416s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:48878 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   373.034462s] DEBUG ThreadId(01) evict{target=(None, Logical { orig_dst: 10.96.2.114:4567, protocol: (), profile: Some(..) })}: linkerd_cache: Cache entry dropped
[   373.034555s] DEBUG ThreadId(01) evict{target=Accept { orig_dst: 10.96.2.114:4567, protocol: () }}: linkerd_cache: Cache entry dropped
[   377.209477s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:51376 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=117
[   377.209679s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:51376 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   381.011413s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:52776 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[   381.012072s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:52776 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   387.209370s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:40766 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=117
[   387.209625s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:40766 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   391.011323s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:40774 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[   391.011661s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:40774 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed

mariaDB log

2021-02-01 23:49:01 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2021-02-01 23:49:01 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2021-02-01 23:49:01 0 [Note] WSREP: EVS version 1
2021-02-01 23:49:01 0 [Note] WSREP: gcomm: connecting to group 'galera', peer 'maria-mb-mariadb-galera-headless.default.svc.cluster.local:'
2021-02-01 23:49:04 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.96.0.105:4567 timed out, no messages seen in PT3S, socket stats: rtt: 31 rttvar: 15 rto: 200000 lost: 0 last_data_recv: 3001 cwnd: 10 last_queued_since: 3000339972 last_delivered_since: 3000339972 send_queue_length: 0 send_queue_bytes: 0
2021-02-01 23:49:04 0 [Note] WSREP: EVS version upgrade 0 -> 1
2021-02-01 23:49:04 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2021-02-01 23:49:04 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2021-02-01 23:49:04 0 [Note] WSREP: view(view_id(NON_PRIM,0f041dd0-a4ae,1) memb {
    0f041dd0-a4ae,0
} joined {
} left {
} partitioned {
})
2021-02-01 23:49:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50149S), skipping check
2021-02-01 23:49:08 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.96.0.105:4567 timed out, no messages seen in PT3S, socket stats: rtt: 27 rttvar: 13 rto: 201000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000068129 last_delivered_since: 3000068129 send_queue_length: 0 send_queue_bytes: 0
2021-02-01 23:49:12 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.96.0.105:4567 timed out, no messages seen in PT3S, socket stats: rtt: 36 rttvar: 18 rto: 200000 lost: 0 last_data_recv: 3001 cwnd: 10 last_queued_since: 3000115008 last_delivered_since: 3000115008 send_queue_length: 0 send_queue_bytes: 0
00000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000083006 last_delivered_since: 3000083006 send_queue_length: 0 send_queue_bytes: 0
2021-02-01 23:49:33 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.96.0.105:4567 timed out, no messages seen in PT3S, socket stats: rtt: 41 rttvar: 20 rto: 200000 lost: 0 last_data_recv: 3001 cwnd: 10 last_queued_since: 3000936789 last_delivered_since: 3000936789 send_queue_length: 0 send_queue_bytes: 0
2021-02-01 23:49:34 0 [Note] WSREP: PC protocol downgrade 1 -> 0
2021-02-01 23:49:34 0 [Note] WSREP: view((empty))
2021-02-01 23:49:34 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
     at gcomm/src/pc.cpp:connect():160
2021-02-01 23:49:34 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
2021-02-01 23:49:34 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1632: Failed to open channel 'galera' at 'gcomm://maria-mb-mariadb-galera-headless.default.svc.cluster.local': -110 (Connection timed out)
2021-02-01 23:49:34 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2021-02-01 23:49:34 0 [ERROR] WSREP: wsrep::connect(gcomm://maria-mb-mariadb-galera-headless.default.svc.cluster.local) failed: 7
2021-02-01 23:49:34 0 [ERROR] Aborting

Environment

Possible solution

Anything using gcomm protocol must be bypassed in the proxy-init iptables via skip-inbound and skip-outbound port

adleong commented 3 years ago

Hi @barkardk!

Based on this log line from the Linkerd proxy:

..target.addr=10.96.0.105:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0

it looks like Linkred is failing outbound protocol detection when sending to port 4567. I notice that this port is missing from your list of opaque ports. Is that intentional?

I also want to confirm exactly what you mean when you mean when you say that you configure the cluster with the config.linkerd.io/opaque-ports annotation. That annotation should be set in the pod spec template of your workload (such as a deployment) or on a namespace. If set on a namespace, it will only apply to pods which are created in that namespace since the annotation was added.

barkardk commented 3 years ago

hi @adleong looks like I made a typo when I created the issue, the 4567 was absolutely a part of my original setting. I have edited the issue and changed to the correct port

adleong commented 3 years ago

Thanks @barkardk! This is surprising because if 4567 is configured as an opaque port, the proxy should not be attempting protocol detection. In my previous comment I ask about exactly where you have the config.linkerd.io/opaque-ports annotation defined. Can you clarify?

barkardk commented 3 years ago

@adleong I have it set in the statefulset pod spec , where I would put the skip-inbound and outbound port settings for example the 3306 port is marked opaque using this setting and works perfectly, port 4444 actually does not need any specific configuration , its an rsync protocol there and works out of the box its only the 2 gcomm ports 4567 and 4568 that do not work unless I set them to skip the proxy specifically

olix0r commented 3 years ago

@barkardk are you able to share your manifests (redacted if necessary)?

In your initial report you show a log line like:

[   352.706744s] DEBUG ThreadId(01) poll_profile: linkerd_service_profiles::client: profile received: DestinationProfile { fully_qualified_name: "", opaque_protocol: false, routes: [], retry_budget: Some(RetryBudget { retry_ratio: 0.2, min_retries_per_second: 10, ttl: Some(Duration { seconds: 10, nanos: 0 }) }), dst_overrides: [], endpoint: Some(WeightedAddr { addr: Some(TcpAddress { ip: Some(IpAddress { ip: Some(Ipv4(174064242)) }), port: 4567 }), weight: 10000, metric_labels: {"statefulset": "maria-mb-mariadb-galera", "serviceaccount": "default", "pod": "maria-mb-mariadb-galera-0", "namespace": "default"}, tls_identity: None, protocol_hint: None, authority_override: None }) }

What's notable here is that we see opaque_protocol: false... does the maria-mb-maridab-galera-0 pod have an opaque-ports annotation set?

barkardk commented 3 years ago

@olix0r I must have made a copy/paste error there as I am sometimes testing more than one cluster at a time . Apologies for the confusion , I have now double checked and these are my annotations in the mariadb statefulset if I dont set the inbound and outbound to "" then the iptables rules will set the 3306 port to be bypassed by default.

annotations:
        config.linkerd.io/opaque-ports: 3306,4444,4568,4567
        config.linkerd.io/proxy-log-level: linkerd=debug,warn
        config.linkerd.io/skip-inbound-ports: ""
        config.linkerd.io/skip-outbound-ports: ""
        linkerd.io/inject: enabled

These are the galera logs

mysql-2 mysql 2021-02-24 19:33:24 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
mysql-2 mysql 2021-02-24 19:33:24 0 [Note] WSREP: view(view_id(NON_PRIM,27691bf0,1) memb {
mysql-2 mysql   27691bf0,0
mysql-2 mysql } joined {
mysql-2 mysql } left {
mysql-2 mysql } partitioned {
mysql-2 mysql })
mysql-2 mysql 2021-02-24 19:33:25 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50161S), skipping check
mysql-2 mysql 2021-02-24 19:33:29 0 [Note] WSREP: (27691bf0, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.100.69:4567 timed out, no messages seen in PT3S, socket stats: rtt: 55 rttvar: 27 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000127870 last_delivered_since: 3000127870 send_queue_length: 0 send_queue_bytes: 0
mysql-2 mysql 2021-02-24 19:33:32 0 [Note] WSREP: (27691bf0, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.101.38:4567 timed out, no messages seen in PT3S, socket stats: rtt: 40 rttvar: 20 rto: 200000 lost: 0 last_data_recv: 3500 cwnd: 10 last_queued_since: 3499895422 last_delivered_since: 3499895422 send_queue_length: 0 send_queue_bytes: 0
mysql-2 mysql 2021-02-24 19:33:36 0 [Note] WSREP: (27691bf0, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.100.69:4567 timed out, no messages seen in PT3S, socket stats: rtt: 42 rttvar: 21 rto: 200000 lost: 0 last_data_recv: 3500 cwnd: 10 last_queued_since: 3499978813 last_delivered_since: 3499978813 send_queue_length: 0 send_queue_bytes: 0
mysql-2 mysql 2021-02-24 19:33:39 0 [Note] WSREP: (27691bf0, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.101.38:4567 timed out, no messages seen in PT3S, socket stats: rtt: 43 rttvar: 21 rto: 200000 lost: 0 last_data_recv: 3500 cwnd: 10 last_queued_since: 3499958234 last_delivered_since: 3499958234 send_queue_length: 0 send_queue_bytes: 0
mysql-2 mysql 2021-02-24 19:33:42 0 [Note] WSREP: (27691bf0, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.100.69:4567 timed out, no messages seen in PT3S, socket stats: rtt: 33 rttvar: 16 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000143820 last_delivered_since: 3000143820 send_queue_length: 0 send_queue_bytes: 0
mysql-2 mysql 2021-02-24 19:33:45 0 [Note] WSREP: (27691bf0, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.101.38:4567 timed out, no messages seen in PT3S, socket stats: rtt: 36 rttvar: 18 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000123190 last_delivered_since: 3000123190 send_queue_length: 0 send_queue_bytes: 0
mysql-2 mysql 2021-02-24 19:33:49 0 [Note] WSREP: (27691bf0, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.100.69:4567 timed out, no messages seen in PT3S, socket stats: rtt: 64 rttvar: 32 rto: 200000 lost: 0 last_data_recv: 3500 cwnd: 10 last_queued_since: 3499895122 last_delivered_since: 3499895122 send_queue_length: 0 send_queue_bytes: 0
mysql-2 mysql 2021-02-24 19:33:52 0 [Note] WSREP: (27691bf0, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.101.38:4567 timed out, no messages seen in PT3S, socket stats: rtt: 44 rttvar: 22 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000030454 last_delivered_since: 3000030454 send_queue_length: 0 send_queue_bytes: 0
mysql-2 mysql 2021-02-24 19:33:54 0 [Note] WSREP: view((empty))
mysql-2 mysql 2021-02-24 19:33:54 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
mysql-2 mysql    at gcomm/src/pc.cpp:connect():160
mysql-2 mysql 2021-02-24 19:33:54 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
mysql-2 mysql 2021-02-24 19:33:54 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1474: Failed to open channel 'mysql' at 'gcomm://mysql-0.mysql.default.svc.cluster.local,mysql-1.mysql.default.svc.cluster.local': -110 (Connection timed out)
mysql-2 mysql 2021-02-24 19:33:54 0 [ERROR] WSREP: gcs connect failed: Connection timed out
mysql-2 mysql 2021-02-24 19:33:54 0 [ERROR] WSREP: wsrep::connect(gcomm://mysql-0.mysql.default.svc.cluster.local,mysql-1.mysql.default.svc.cluster.local) failed: 7
mysql-2 mysql 2021-02-24 19:33:54 0 [ERROR] Aborting
mysql-2 mysql

and the proxy logs

mysql-2 linkerd-proxy [   216.635829s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39250, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Cache entry dropped
mysql-2 linkerd-proxy [   218.676064s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39472 target.addr=10.0.100.244:9104}: linkerd_tls::server: Peeked bytes from TCP stream sz=115
mysql-2 linkerd-proxy [   218.676111s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39472 target.addr=10.0.100.244:9104}: linkerd_app_inbound::require_identity_for_ports: port=9104 peer.id=None(NoTlsFromRemote) id_required=false
mysql-2 linkerd-proxy [   218.676148s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39472 target.addr=10.0.100.244:9104}: linkerd_detect: Detected protocol=Some(Http1) elapsed=9.071µs
mysql-2 linkerd-proxy [   218.676158s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39472 target.addr=10.0.100.244:9104}: linkerd_cache: Caching new service target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39472, client_id: None(NoTlsFromRemote) })
mysql-2 linkerd-proxy [   218.676167s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39472 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Creating HTTP service version=Http1
mysql-2 linkerd-proxy [   218.676197s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39472 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Handling as HTTP version=Http1
mysql-2 linkerd-proxy [   218.676308s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: method=GET uri=http://10.0.100.244:9104/ version=HTTP/1.1
mysql-2 linkerd-proxy [   218.676321s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: headers={"host": "10.0.100.244:9104", "user-agent": "kube-probe/1.16+", "accept-encoding": "gzip"}
mysql-2 linkerd-proxy [   218.677032s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39472 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
mysql-2 linkerd-proxy [   218.677127s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39472 target.addr=10.0.100.244:9104}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   218.677149s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39472, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Awaiting idleness
mysql-2 linkerd-proxy [   218.678255s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39272, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Cache entry dropped
mysql-2 linkerd-proxy [   223.324698s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:33918 target.addr=10.0.100.244:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=119
mysql-2 linkerd-proxy [   223.324891s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:33918 target.addr=10.0.100.244:4191}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   226.582815s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:33944 target.addr=10.0.100.244:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=120
mysql-2 linkerd-proxy [   226.583175s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:33944 target.addr=10.0.100.244:4191}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   226.633841s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39534 target.addr=10.0.100.244:9104}: linkerd_tls::server: Peeked bytes from TCP stream sz=115
mysql-2 linkerd-proxy [   226.633880s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39534 target.addr=10.0.100.244:9104}: linkerd_app_inbound::require_identity_for_ports: port=9104 peer.id=None(NoTlsFromRemote) id_required=false
mysql-2 linkerd-proxy [   226.633916s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39534 target.addr=10.0.100.244:9104}: linkerd_detect: Detected protocol=Some(Http1) elapsed=9.12µs
mysql-2 linkerd-proxy [   226.633926s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39534 target.addr=10.0.100.244:9104}: linkerd_cache: Caching new service target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39534, client_id: None(NoTlsFromRemote) })
mysql-2 linkerd-proxy [   226.633936s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39534 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Creating HTTP service version=Http1
mysql-2 linkerd-proxy [   226.633966s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39534 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Handling as HTTP version=Http1
mysql-2 linkerd-proxy [   226.634096s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: method=GET uri=http://10.0.100.244:9104/ version=HTTP/1.1
mysql-2 linkerd-proxy [   226.634109s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: headers={"host": "10.0.100.244:9104", "user-agent": "kube-probe/1.16+", "accept-encoding": "gzip"}
mysql-2 linkerd-proxy [   226.634779s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39534 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
mysql-2 linkerd-proxy [   226.634936s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39534 target.addr=10.0.100.244:9104}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   226.634962s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39534, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Awaiting idleness
mysql-2 linkerd-proxy [   226.636098s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39348, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Cache entry dropped
mysql-2 linkerd-proxy [   228.676041s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39560 target.addr=10.0.100.244:9104}: linkerd_tls::server: Peeked bytes from TCP stream sz=115
mysql-2 linkerd-proxy [   228.676069s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39560 target.addr=10.0.100.244:9104}: linkerd_app_inbound::require_identity_for_ports: port=9104 peer.id=None(NoTlsFromRemote) id_required=false
mysql-2 linkerd-proxy [   228.676108s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39560 target.addr=10.0.100.244:9104}: linkerd_detect: Detected protocol=Some(Http1) elapsed=10.88µs
mysql-2 linkerd-proxy [   228.676120s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39560 target.addr=10.0.100.244:9104}: linkerd_cache: Caching new service target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39560, client_id: None(NoTlsFromRemote) })
mysql-2 linkerd-proxy [   228.676134s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39560 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Creating HTTP service version=Http1
mysql-2 linkerd-proxy [   228.676174s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39560 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Handling as HTTP version=Http1
mysql-2 linkerd-proxy [   228.676316s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: method=GET uri=http://10.0.100.244:9104/ version=HTTP/1.1
mysql-2 linkerd-proxy [   228.676331s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: headers={"host": "10.0.100.244:9104", "user-agent": "kube-probe/1.16+", "accept-encoding": "gzip"}
mysql-2 linkerd-proxy [   228.676875s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39560 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
mysql-2 linkerd-proxy [   228.676939s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39560 target.addr=10.0.100.244:9104}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   228.676961s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39560, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Awaiting idleness
mysql-2 linkerd-proxy [   228.678076s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39364, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Cache entry dropped
mysql-2 linkerd-proxy [   233.324838s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:34006 target.addr=10.0.100.244:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=119
mysql-2 linkerd-proxy [   233.325071s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:34006 target.addr=10.0.100.244:4191}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   236.582808s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:34026 target.addr=10.0.100.244:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=120
mysql-2 linkerd-proxy [   236.583042s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:34026 target.addr=10.0.100.244:4191}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   236.633903s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39616 target.addr=10.0.100.244:9104}: linkerd_tls::server: Peeked bytes from TCP stream sz=115
mysql-2 linkerd-proxy [   236.633940s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39616 target.addr=10.0.100.244:9104}: linkerd_app_inbound::require_identity_for_ports: port=9104 peer.id=None(NoTlsFromRemote) id_required=false
mysql-2 linkerd-proxy [   236.633970s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39616 target.addr=10.0.100.244:9104}: linkerd_detect: Detected protocol=Some(Http1) elapsed=8.559µs
mysql-2 linkerd-proxy [   236.633979s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39616 target.addr=10.0.100.244:9104}: linkerd_cache: Caching new service target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39616, client_id: None(NoTlsFromRemote) })
mysql-2 linkerd-proxy [   236.633989s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39616 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Creating HTTP service version=Http1
mysql-2 linkerd-proxy [   236.634017s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39616 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Handling as HTTP version=Http1
mysql-2 linkerd-proxy [   236.634127s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: method=GET uri=http://10.0.100.244:9104/ version=HTTP/1.1
mysql-2 linkerd-proxy [   236.634151s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: headers={"host": "10.0.100.244:9104", "user-agent": "kube-probe/1.16+", "accept-encoding": "gzip"}
mysql-2 linkerd-proxy [   236.634621s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39616 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
mysql-2 linkerd-proxy [   236.634702s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39616 target.addr=10.0.100.244:9104}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   236.634723s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39616, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Awaiting idleness
mysql-2 linkerd-proxy [   236.635830s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39454, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Cache entry dropped
mysql-2 linkerd-proxy [   238.676100s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39656 target.addr=10.0.100.244:9104}: linkerd_tls::server: Peeked bytes from TCP stream sz=115
mysql-2 linkerd-proxy [   238.676141s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39656 target.addr=10.0.100.244:9104}: linkerd_app_inbound::require_identity_for_ports: port=9104 peer.id=None(NoTlsFromRemote) id_required=false
mysql-2 linkerd-proxy [   238.676182s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39656 target.addr=10.0.100.244:9104}: linkerd_detect: Detected protocol=Some(Http1) elapsed=13.081µs
mysql-2 linkerd-proxy [   238.676205s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39656 target.addr=10.0.100.244:9104}: linkerd_cache: Caching new service target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39656, client_id: None(NoTlsFromRemote) })
mysql-2 linkerd-proxy [   238.676219s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39656 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Creating HTTP service version=Http1
mysql-2 linkerd-proxy [   238.676253s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39656 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Handling as HTTP version=Http1
mysql-2 linkerd-proxy [   238.676357s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: method=GET uri=http://10.0.100.244:9104/ version=HTTP/1.1
mysql-2 linkerd-proxy [   238.676369s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: headers={"host": "10.0.100.244:9104", "user-agent": "kube-probe/1.16+", "accept-encoding": "gzip"}
mysql-2 linkerd-proxy [   238.676948s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39656 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
mysql-2 linkerd-proxy [   238.677081s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39656 target.addr=10.0.100.244:9104}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   238.677101s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39656, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Awaiting idleness
mysql-2 linkerd-proxy [   238.678316s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39472, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Cache entry dropped
mysql-2 linkerd-proxy [   243.035263s] DEBUG ThreadId(01) outbound: linkerd_cache: Caching new service target=Accept { orig_dst: 10.0.100.69:4567, protocol: () }
mysql-2 linkerd-proxy [   243.035373s] DEBUG ThreadId(01) outbound: linkerd_cache: Caching new service target=Accept { orig_dst: 10.0.101.38:4567, protocol: () }
mysql-2 linkerd-proxy [   243.035486s] DEBUG ThreadId(01) outbound: linkerd_timeout::failfast: TCP Server service has become unavailable
mysql-2 linkerd-proxy [   243.035532s] DEBUG ThreadId(01) outbound: linkerd_timeout::failfast: TCP Server service has become unavailable
mysql-2 linkerd-proxy [   243.035590s] DEBUG ThreadId(01) dst: linkerd_dns: resolve_srv name=linkerd-dst-headless.linkerd.svc.cluster.local
mysql-2 linkerd-proxy [   243.038418s] DEBUG ThreadId(01) poll_profile: linkerd_service_profiles::client: profile received: DestinationProfile { fully_qualified_name: "", opaque_protocol: false, routes: [], retry_budget: Some(RetryBudget { retry_ratio: 0.2, min_retries_per_second: 10, ttl: Some(Duration { seconds: 10, nanos: 0 }) }), dst_overrides: [], endpoint: Some(WeightedAddr { addr: Some(TcpAddress { ip: Some(IpAddress { ip: Some(Ipv4(167798054)) }), port: 4567 }), weight: 10000, metric_labels: {"statefulset": "mysql", "namespace": "default", "serviceaccount": "mysql", "pod": "mysql-1"}, tls_identity: None, protocol_hint: None, authority_override: None }) }
mysql-2 linkerd-proxy [   243.038512s] DEBUG ThreadId(01) poll_profile: linkerd_service_profiles::client: profile received: DestinationProfile { fully_qualified_name: "", opaque_protocol: false, routes: [], retry_budget: Some(RetryBudget { retry_ratio: 0.2, min_retries_per_second: 10, ttl: Some(Duration { seconds: 10, nanos: 0 }) }), dst_overrides: [], endpoint: Some(WeightedAddr { addr: Some(TcpAddress { ip: Some(IpAddress { ip: Some(Ipv4(167797829)) }), port: 4567 }), weight: 10000, metric_labels: {"statefulset": "mysql", "pod": "mysql-0", "serviceaccount": "mysql", "namespace": "default"}, tls_identity: None, protocol_hint: None, authority_override: None }) }
mysql-2 linkerd-proxy [   243.049627s] DEBUG ThreadId(01) dst: linkerd_dns: addrs=[10.0.100.219:8086]
mysql-2 linkerd-proxy [   243.049662s] DEBUG ThreadId(01) dst: linkerd_proxy_dns_resolve: addrs=[10.0.100.219:8086]
mysql-2 linkerd-proxy [   243.324756s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:34108 target.addr=10.0.100.244:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=119
mysql-2 linkerd-proxy [   243.324999s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:34108 target.addr=10.0.100.244:4191}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   246.035946s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:44906 target.addr=10.0.100.69:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
mysql-2 linkerd-proxy [   246.035977s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:44906 target.addr=10.0.100.69:4567}: linkerd_detect: Detected protocol=None elapsed=2.997389447s
mysql-2 linkerd-proxy [   246.035989s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:44906 target.addr=10.0.100.69:4567}: linkerd_cache: Caching new service target=(None, Logical { orig_dst: 10.0.100.69:4567, protocol: (), profile: Some(..) })
mysql-2 linkerd-proxy [   246.036097s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:44906 target.addr=10.0.100.69:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
mysql-2 linkerd-proxy [   246.036124s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:44906 target.addr=10.0.100.69:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.0.100.69:4567
mysql-2 linkerd-proxy [   246.036264s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:49554 target.addr=10.0.101.38:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
mysql-2 linkerd-proxy [   246.036275s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:49554 target.addr=10.0.101.38:4567}: linkerd_detect: Detected protocol=None elapsed=2.997693181s
mysql-2 linkerd-proxy [   246.036282s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:49554 target.addr=10.0.101.38:4567}: linkerd_cache: Caching new service target=(None, Logical { orig_dst: 10.0.101.38:4567, protocol: (), profile: Some(..) })
mysql-2 linkerd-proxy [   246.036336s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:49554 target.addr=10.0.101.38:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
mysql-2 linkerd-proxy [   246.036361s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:49554 target.addr=10.0.101.38:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.0.101.38:4567
mysql-2 linkerd-proxy [   246.036719s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:44906 target.addr=10.0.100.69:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.0.100.244:44928 keepalive=Some(10s)
mysql-2 linkerd-proxy [   246.036735s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:44906 target.addr=10.0.100.69:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
mysql-2 linkerd-proxy [   246.037115s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:49554 target.addr=10.0.101.38:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.0.100.244:49576 keepalive=Some(10s)
mysql-2 linkerd-proxy [   246.037132s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:49554 target.addr=10.0.101.38:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
mysql-2 linkerd-proxy [   246.037486s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:44906 target.addr=10.0.100.69:4567}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   246.037528s] DEBUG ThreadId(01) evict{target=(None, Logical { orig_dst: 10.0.100.69:4567, protocol: (), profile: Some(..) })}: linkerd_cache: Awaiting idleness
mysql-2 linkerd-proxy [   246.037539s] DEBUG ThreadId(01) evict{target=Accept { orig_dst: 10.0.100.69:4567, protocol: () }}: linkerd_cache: Awaiting idleness
mysql-2 linkerd-proxy [   246.038266s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.0.100.244:49554 target.addr=10.0.101.38:4567}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   246.038291s] DEBUG ThreadId(01) evict{target=(None, Logical { orig_dst: 10.0.101.38:4567, protocol: (), profile: Some(..) })}: linkerd_cache: Awaiting idleness
mysql-2 linkerd-proxy [   246.038298s] DEBUG ThreadId(01) evict{target=Accept { orig_dst: 10.0.101.38:4567, protocol: () }}: linkerd_cache: Awaiting idleness
mysql-2 linkerd-proxy [   246.582927s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:34130 target.addr=10.0.100.244:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=120
mysql-2 linkerd-proxy [   246.583173s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.0.100.70:34130 target.addr=10.0.100.244:4191}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   246.634372s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39720 target.addr=10.0.100.244:9104}: linkerd_tls::server: Peeked bytes from TCP stream sz=115
mysql-2 linkerd-proxy [   246.634413s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39720 target.addr=10.0.100.244:9104}: linkerd_app_inbound::require_identity_for_ports: port=9104 peer.id=None(NoTlsFromRemote) id_required=false
mysql-2 linkerd-proxy [   246.634453s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39720 target.addr=10.0.100.244:9104}: linkerd_detect: Detected protocol=Some(Http1) elapsed=11.43µs
mysql-2 linkerd-proxy [   246.634464s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39720 target.addr=10.0.100.244:9104}: linkerd_cache: Caching new service target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39720, client_id: None(NoTlsFromRemote) })
mysql-2 linkerd-proxy [   246.634478s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39720 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Creating HTTP service version=Http1
mysql-2 linkerd-proxy [   246.634518s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39720 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: Handling as HTTP version=Http1
mysql-2 linkerd-proxy [   246.634671s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: method=GET uri=http://10.0.100.244:9104/ version=HTTP/1.1
mysql-2 linkerd-proxy [   246.634686s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:37592 target.addr=10.0.100.244:9104}:http{v=1.x}:http1:profile:http1: linkerd_proxy_http::client: headers={"host": "10.0.100.244:9104", "user-agent": "kube-probe/1.16+", "accept-encoding": "gzip"}
mysql-2 linkerd-proxy [   246.635494s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39720 target.addr=10.0.100.244:9104}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
mysql-2 linkerd-proxy [   246.635585s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.0.100.70:39720 target.addr=10.0.100.244:9104}: linkerd_app_core::serve: Connection closed
mysql-2 linkerd-proxy [   246.635606s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39720, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Awaiting idleness
mysql-2 linkerd-proxy [   246.636807s] DEBUG ThreadId(01) evict{target=(Some(Http1), TcpAccept { target_addr: 10.0.100.244:9104, client_addr: 10.0.100.70:39534, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Cache entry dropped

the iptables

-A PREROUTING -m comment --comment "proxy-init/install-proxy-init-prerouting/1614195038" -j PROXY_INIT_REDIRECT
-A OUTPUT -m comment --comment "proxy-init/install-proxy-init-output/1614195038" -j PROXY_INIT_OUTPUT
-A PROXY_INIT_OUTPUT ! -d 127.0.0.1/32 -o lo -m owner --uid-owner 2102 -m comment --comment "proxy-init/redirect-non-loopback-local-traffic/1614195038" -j PROXY_INIT_REDIRECT
-A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -m comment --comment "proxy-init/ignore-proxy-user-id/1614195038" -j RETURN
-A PROXY_INIT_OUTPUT -o lo -m comment --comment "proxy-init/ignore-loopback/1614195038" -j RETURN
-A PROXY_INIT_OUTPUT -p tcp -m comment --comment "proxy-init/redirect-all-outgoing-to-proxy-port/1614195038" -j REDIRECT --to-ports 4140
-A PROXY_INIT_REDIRECT -p tcp -m multiport --dports 4190,4191 -m comment --comment "proxy-init/ignore-port-4190,4191/1614195038" -j RETURN
-A PROXY_INIT_REDIRECT -p tcp -m comment --comment "proxy-init/redirect-all-incoming-to-proxy-port/1614195038" -j REDIRECT --to-ports 4143
barkardk commented 3 years ago

my steps are as follows

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install mysql bitnami/mariadb-galera
linkerd-edge-21.1.3 install |kubectl apply -f -
k get statefulset <mygaleracluster>  -o yaml | linkerd-edge-21.1.3 inject - | kubectl apply -f -

and then I manually edit the annotations and add this

annotations:
        config.linkerd.io/opaque-ports: 3306,4444,4568,4567
        config.linkerd.io/proxy-log-level: linkerd=debug,warn
        config.linkerd.io/skip-inbound-ports: ""
        config.linkerd.io/skip-outbound-ports: ""
        linkerd.io/inject: enabled

and then I always verify that the iptables rules look correct via k logs mysql-2 -o linkerd-init

barkardk commented 3 years ago

Also linkerd-edge-21.2.3 the mariadb logs are pretty much identical with the exact same timeout linkerd proxy-logs

time="2021-02-24T19:45:24Z" level=info msg="running version edge-21.2.3"
[     0.002456s] DEBUG ThreadId(01) linkerd_app::env: Adding 4143 to LINKERD2_PROXY_INBOUND_PORTS_REQUIRE_IDENTITY
[     0.002503s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.002586s] DEBUG ThreadId(01) linkerd_app: building app
[     0.003008s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.003031s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.003035s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.003037s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.003040s]  INFO ThreadId(01) linkerd2_proxy: Local identity is mysql.default.serviceaccount.identity.linkerd.cluster.local
[     0.003043s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.003046s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.003049s] DEBUG ThreadId(01) linkerd_app: spawning daemon thread
[     0.004202s] DEBUG ThreadId(02) daemon: linkerd_app: running admin thread
[     0.004268s] DEBUG ThreadId(02) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_proxy_identity::certify: Identity daemon running
[     0.004460s] DEBUG ThreadId(01) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_dns: resolve_srv name=linkerd-identity-headless.linkerd.svc.cluster.local.
[     0.005197s] DEBUG ThreadId(01) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_dns: addrs=[10.0.100.118:8080]
[     0.005256s] DEBUG ThreadId(01) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_tls::client: Initiating TLS connection server.id=linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.005264s] DEBUG ThreadId(01) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_proxy_transport::connect: Connecting server.addr=10.0.100.118:8080
[     0.006516s] DEBUG ThreadId(01) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:h2: linkerd_proxy_transport::connect: Connected local.addr=10.0.100.235:38572 keepalive=Some(10s)
[     0.020022s] DEBUG ThreadId(02) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_identity: certified mysql.default.serviceaccount.identity.linkerd.cluster.local
[     0.020051s] DEBUG ThreadId(02) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_proxy_identity::certify: daemon certified until SystemTime { tv_sec: 1614282344, tv_nsec: 0 }
[     0.020066s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: mysql.default.serviceaccount.identity.linkerd.cluster.local
[     0.470816s] DEBUG ThreadId(01) outbound: linkerd_cache: Caching new service target=Accept { orig_dst: 10.0.100.69:4567, protocol: () }
[     0.471015s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56514 target.addr=10.0.100.69:4567}: linkerd_timeout::failfast: TCP Server service has become unavailable
[     0.471046s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56514 target.addr=10.0.100.69:4567}:poll_profile: linkerd_dns: resolve_srv name=linkerd-dst-headless.linkerd.svc.cluster.local.
[     0.472442s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56514 target.addr=10.0.100.69:4567}:poll_profile: linkerd_dns: addrs=[10.0.100.130:8086]

[     9.973497s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56560 target.addr=10.0.100.69:4567}: linkerd_app_core::serve: Connection closed
[    10.007735s] DEBUG ThreadId(01) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_dns: resolve_srv name=linkerd-identity-headless.linkerd.svc.cluster.local.
[    10.009058s] DEBUG ThreadId(01) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_dns: addrs=[10.0.100.118:8080]
[    10.009084s] DEBUG ThreadId(01) identity:identity_daemon{peer.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_proxy_dns_resolve: addrs=[10.0.100.118:8080]
[    10.475571s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56514 target.addr=10.0.100.69:4567}:poll_profile: linkerd_dns: resolve_srv name=linkerd-dst-headless.linkerd.svc.cluster.local.
[    10.477003s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56514 target.addr=10.0.100.69:4567}:poll_profile: linkerd_dns: addrs=[10.0.100.130:8086]
[    10.477031s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56514 target.addr=10.0.100.69:4567}:poll_profile: linkerd_proxy_dns_resolve: addrs=[10.0.100.130:8086]
[    13.472339s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:53942 target.addr=10.0.101.38:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[    13.472382s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:53942 target.addr=10.0.101.38:4567}: linkerd_detect: Detected protocol=None elapsed=3.499967622s
[    13.472446s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:53942 target.addr=10.0.101.38:4567}:tcp:tcp.forward{server.addr=10.0.101.38:4567}: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
    20.475342s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:54022 target.addr=10.0.101.38:4567}: linkerd_app_core::serve: Connection closed
[    23.974090s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56708 target.addr=10.0.100.69:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[    23.974169s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56708 target.addr=10.0.100.69:4567}: linkerd_detect: Detected protocol=None elapsed=3.500641556s
[    23.974234s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56708 target.addr=10.0.100.69:4567}:tcp:tcp.forward{server.addr=10.0.100.69:4567}: linkerd_tls::client: Peer does not support TLS reason=n
ot_provided_by_service_discovery
[    23.974252s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56708 target.addr=10.0.100.69:4567}:tcp:tcp.forward{server.addr=10.0.100.69:4567}: linkerd_proxy_transport::connect: Connecting server.add
r=10.0.100.69:4567
[    23.974824s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56708 target.addr=10.0.100.69:4567}:tcp:tcp.forward{server.addr=10.0.100.69:4567}: linkerd_proxy_transport::connect: Connected local.addr=
10.0.100.235:56742 keepalive=Some(10s)
[    23.974863s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56708 target.addr=10.0.100.69:4567}:tcp:tcp.forward{server.addr=10.0.100.69:4567}: linkerd_proxy_transport::metrics: client connection ope
n
[    23.975804s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.235:56708 target.addr=10.0.100.69:4567}: linkerd_app_core::serve: Connection closed
[    25.005842s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.0.100.70:35788 target.addr=10.0.100.235:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=119
[    25.005899s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.0.100.70:35788 target.addr=10.0.100.235:4191}: linkerd_detect: Detected protocol=Some(Http1) elapsed=8.72µs
[    25.005912s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.0.100.70:35788 target.addr=10.0.100.235:4191}: linkerd_proxy_http::server: Creating HTTP service version=Http1
[    25.005928s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.0.100.70:35788 target.addr=10.0.100.235:4191}: linkerd_proxy_http::server: Handling as HTTP version=Http1
[    25.006101s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.0.100.70:35788 target.addr=10.0.100.235:4191}: linkerd_proxy_http::server: The client is shutting down the connection
res=Ok(())
[    25.006179s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.0.100.70:35788 target.addr=10.0.100.235:4191}: linkerd_app_core::serve: Connection closed
olix0r commented 3 years ago

@barkardk I was able to get things working, though I took a slightly different approach. It would be good to confirm that my approach works for you and then we should figure out how they differ.

# Using last week's edge release:
:; linkerd version
Client version: edge-21.2.3
Server version: edge-21.2.3

# First save the YAML output to be edited
:; helm template mysql bitnami/mariadb-galera >mysql.yaml && cp mysql.yaml{,.orig}

# Edit the YAML to have the correct annotations
:; diff -u mysql.yaml.orig mysql.yaml
--- mysql.yaml.orig 2021-02-24 22:06:40.275459745 +0000
+++ mysql.yaml  2021-02-24 22:08:05.645092833 +0000
@@ -227,6 +227,12 @@
         helm.sh/chart: mariadb-galera-5.6.2
         app.kubernetes.io/instance: mysql
         app.kubernetes.io/managed-by: Helm
+      annotations:
+        linkerd.io/inject: enabled
+        config.linkerd.io/opaque-ports: 3306,4444,4567,4568
+        config.linkerd.io/proxy-log-level: linkerd=debug,warn
+        config.linkerd.io/skip-inbound-ports: ""
+        config.linkerd.io/skip-outbound-ports: ""
     spec:

       serviceAccountName: default

# Apply the edited YAML
:; k apply -f mysql.yaml

# Wait 60-120s for everything to come up:
:; kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
mysql-mariadb-galera-0   2/2     Running   0          5m57s
mysql-mariadb-galera-1   2/2     Running   0          5m19s
mysql-mariadb-galera-2   2/2     Running   0          4m37s

Spot-checking the proxy logs, we see opaque_protocols: true in the discovery response:

:; k logs po/mysql-mariadb-galera-1  -c linkerd-proxy |head -n 100 | grep opaque_protocol
[     0.919824s] DEBUG ThreadId(01) outbound:accept{client.addr=10.42.0.124:51294 target.addr=10.42.0.122:4567}:poll_profile: linkerd_service_profiles::client: profile received: DestinationProfile { fully_qualified_name: "", opaque_protocol: true, routes: [], retry_budget: Some(RetryBudget { retry_ratio: 0.2, min_retries_per_second: 10, ttl: Some(Duration { seconds: 10, nanos: 0 }) }), dst_overrides: [], endpoint: Some(WeightedAddr { addr: Some(TcpAddress { ip: Some(IpAddress { ip: Some(Ipv4(170524794)) }), port: 4567 }), weight: 10000, metric_labels: {"pod": "mysql-mariadb-galera-0", "serviceaccount": "default", "control_plane_ns": "linkerd", "namespace": "default", "statefulset": "mysql-mariadb-galera"}, tls_identity: Some(TlsIdentity { strategy: Some(DnsLikeIdentity(DnsLikeIdentity { name: "default.default.serviceaccount.identity.linkerd.cluster.local" })) }), protocol_hint: Some(ProtocolHint { opaque_transport: Some(OpaqueTransport { inbound_port: 4143 }), protocol: Some(H2(H2)) }), authority_override: None }) }

I did it this way (saving and editing the YAML) because I ran into errors when trying to restart the mysql instances as they kept failing to start due to the data volume already existing... or something.

barkardk commented 3 years ago

@olix0r
If I run the steps you suggested using the latest edge release I get the same results

mariadbx-mariadb-galera-2 linkerd-proxy [     0.973298s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.100.69:38648 target.addr=10.0.100.31:4567}:poll_profile: linkerd_service_profiles::client: profile received: DestinationProfile { fully_qualified_name: "", opaque_protocol: true, routes: [], retry_budget: Some(RetryBudget { retry_ratio: 0.2, min_retries_per_second: 10, ttl: Some(Duration { seconds: 10, nanos: 0 }) }), dst_overrides: [], endpoint: Some(WeightedAddr { addr: Some(TcpAddress { ip: Some(IpAddress { ip: Some(Ipv4(167797791)) }), port: 4567 }), weight: 10000, metric_labels: {"serviceaccount": "default", "namespace": "default", "control_plane_ns": "linkerd", "pod": "mariadbx-mariadb-galera-0", "statefulset": "mariadbx-mariadb-galera"}, tls_identity: Some(TlsIdentity { strategy: Some(DnsLikeIdentity(DnsLikeIdentity { name: "default.default.serviceaccount.identity.linkerd.cluster.local" })) }), protocol_hint: Some(ProtocolHint { opaque_transport: Some(OpaqueTransport { inbound_port: 4143 }), protocol: Some(H2(H2)) }), authority_override: None }) }

All three pods looked successfully injected. I was successful in querying the cluster after injection

kubectl exec -it mysql-client  -- mysql -h mariadbx-mariadb-galera -uroot -<password> -e 'SHOW databases;' -c mysql
+--------------------+
| Database           |
+--------------------+
| information_schema |
| my_database        |
| mysql              |
| performance_schema |
| test               |
+--------------------+

I then ran a restart (no pv issues)

k rollout restart statefulset mariadbx-mariadb-galera

and had no issues connecting to the cluster after pod restarts
linkerd edges looks like so

linkerd-edge-21.2.3 viz edges po
SRC                                  DST                         SRC_NS        DST_NS    SECURED
mariadbx-mariadb-galera-0            mariadbx-mariadb-galera-1   default       default   √
mariadbx-mariadb-galera-0            mariadbx-mariadb-galera-2   default       default   √
mariadbx-mariadb-galera-1            mariadbx-mariadb-galera-0   default       default   √
mariadbx-mariadb-galera-2            mariadbx-mariadb-galera-0   default       default   √
mariadbx-mariadb-galera-2            mariadbx-mariadb-galera-1   default       default   √
linkerd-prometheus-dc746f9cb-rl89d   mariadbx-mariadb-galera-0   linkerd-viz   default   √
linkerd-prometheus-dc746f9cb-rl89d   mariadbx-mariadb-galera-1   linkerd-viz   default   √
linkerd-prometheus-dc746f9cb-rl89d   mariadbx-mariadb-galera-2   linkerd-viz   default   √

So this is great news for fresh galera installs.

olix0r commented 3 years ago

@barkardk great! I'm going to close this for now but I'm happy to reopen it if you run into any more issues like this. Thanks for the helpful report.

barkardk commented 3 years ago

@olix0r I dont agree with issue being closed. We cannot delete galera clusters and set them up from scratch to inject linkerd . This solution only works for brand new clusters, running ones will have the connection issue

olix0r commented 3 years ago

@barkardk Are you referring to clusters running prior edge releases? Or do you have concerns about the behavior once stable-2.10 is released? We can't do much to support older edge releases -- they're experimental by nature. But, if you have concerns about how things will work with stable-2.10, can you elaborate? Is this about upgrading unmeshed Galera clusters to use Linkerd? Or what use case should we be thinking through?

barkardk commented 3 years ago

Ah I see, apologies for the misunderstanding. I tried injecting linkerd using the latest edge release to already running clusters, which would be what we would do in our prod envs and that resulted in the exact same error as initially reported.
I seem to only be able to mesh brand new galera clusters , not already running ones.
So even if the opaque port setting works on new clusters, I cannot seem to find a way to inject it into already running ones without connectivity issues

olix0r commented 3 years ago

@barkardk I would set the opaque-ports annotation on the workloads before injecting Linkerd. This annotation being set on the unmeshed resources should allow newly meshed instances to communicate properly. It would be great if you could help confirm this!

barkardk commented 3 years ago

@olix0r Ah , did not get that. I will run a few tests

barkardk commented 3 years ago

@olix0r I really cannot get this to work properly. I have tried this all ways to sunday with pretty poor outcomes.
Now lets say I test just the 4444 SST (state transfer ) port
The galera wsrep on that port is invoked when a brand new node joins a galera cluster. We have our cluster setup to use rsync as our state transfer protocol.
So what happens when I join a new node to galera is that the new node notifies the master that it wants to join the cluster.
It then sets up an rsync server that listens to port 4444.

The master should initiate a tables dump (this is a blocking method). and then connect to the rsync server that the client(JOINER) has setup for it and send the database files (DONOR) via rsync. It knows when to start syncing when galera has written a snall file called tables_dumped in /var/lib/mysql. This file never gets written, galera never seems to know it should trigger this dump when the ports are meshed opaque using latest linkerd RC release 21.2.4 . The communication gets blocked.

rsync per se works fine with the opaque port setting. Sadly though it looks like the dump part of the session join never happens.
What this means for SST is that the client will indefinitely wait for the tables_dumped file to be written.

And another scenarioo, if database nodes get out of sync they use port 4568 and IST (incremental state transfer) instead. That is a gcomm protocol. With the 4568 set to opaque I have a similar issue as with 4444.
Galera will hang indefinitely for the sync even if the port communication per se is actually open

>variables.wsrep_sync_wait = 7, mask = 1
mysql-1 mysql 2021-03-02 17:13:20 17 [Warning] WSREP: gcs_caused() returned -107 (Transport endpoint is not connected)
mysql-1 mysql 2021-03-02 17:13:20 17 [Note] WSREP: BEGIN failed, MDL released: 17
mysql-1 mysql 2021-03-02 17:13:20 19 [Note] WSREP: wsrep_sync_wait: thd-

I can temporarily make it look like galera works by first injecting linkerd set to disabled. restart the pods and then enable afterwards. But my applications have connection errors and once I need to restart the pods and invoke either SST or IST over ports 4568 and/or 4444 my nodes will not report correctly to galera and hang.

I did try the linkerd-await app , to see if galera perhaps tried to send messages before linkerd was ready to server but that did not change anything.
So I still cant say that the opaque protocol works with the galera replication plugin.
Exactly why though I have not been able to pinpoint.
With port 4444 meshed as opaque I can see that galera never finishes the table dump but it will start the rsync server and client.
With port 4568 the IST will never happen , the node that tries to sync will wait indefinetly I havent fully tested port 4567 (as in meshing that with the 4568 and 4444 skipped) but I am about to start that.

olix0r commented 3 years ago

@barkardk Sorry for the radio silence.

We've just fixed an issue that seems likely to be at least somewhat related to this https://github.com/linkerd/linkerd2/issues/5539. This could cause forwarded TCP streams to hang unexpectedly. I'm not 100% sure this is what you were seeing, though. I'll try to take a deeper look at this setup in the next few days. I'd encourage you to try edge-21.3.2 if you have time to test this before I do.

barkardk commented 3 years ago

@olix0r I tested using the latest edge as you suggested.
This is what I did a) Injected linkerd into a running galera cluster and wait for the rollingRestart to complete
This resulted in hanging nodes and inconsistent galera cluster state. I do not recommend this
b) scaled replicas down to 1, injected linkerd to the statefulset, restarted the pod and wait for it to come up (this means a little downtime of course), then scale up to 3 replicas. This checks the Incremental state transfer port 4368 communication
Worked perfectly , the new pods joined without any issues
c) Scaled down to 1 pod, force deleted the pvc for the other 2 pods and then scaled up to 3 again. This forces the rsync, that we have selected for our, SST to start up as the nodes are now brand new. Port 4444 communication that is.
Had no issues
d) Randomly restarted pods and forced downscale of one pod for a while to see that galera successfully healed while meshed. No issues

linkerd viz shows this

linkerd-edge-21.3.2 viz edges po
SRC                           DST       SRC_NS        DST_NS    SECURED
mysql-0                       mysql-1   default       default   √
mysql-2                       mysql-0   default       default   √
mysql-2                       mysql-1   default       default   √
prometheus-54574df8b8-nvm86   mysql-0   linkerd-viz   default   √
prometheus-54574df8b8-nvm86   mysql-1   linkerd-viz   default   √
prometheus-54574df8b8-nvm86   mysql-2   linkerd-viz   default   √

So I am going to state that the latest edge solves our galera problem and this issue may be closed.
Thank you so much for all the hard work @olix0r

olix0r commented 3 years ago

@barkardk That's great news! I'm so glad that we've finally got this working for you :) I'd expect stable-2.10.0 shortly (hopefully tomorrow).

Closing this for now but, as always, let us know if you see anything unexpected.