Open ValeriiVozniuk opened 4 months ago
I've made some tests with previous 1.15.6, and it is behaving a lot better. From the start it forms 5+1 nodes groups
01:~$ kubectl -n vault exec -it vault-1 -- vault operator members
Host Name API Address Cluster Address Active Node Version Upgrade Version Redundancy Zone Last Echo
--------- ----------- --------------- ----------- ------- --------------- --------------- ---------
vault-1 http://10.42.0.9:8200 https://vault-1.vault-internal:8201 true 1.15.6 n/a n/a n/a
01:~$ kubectl -n vault exec -it vault-3 -- vault operator members
Host Name API Address Cluster Address Active Node Version Upgrade Version Redundancy Zone Last Echo
--------- ----------- --------------- ----------- ------- --------------- --------------- ---------
vault-3 http://10.42.1.12:8200 https://vault-3.vault-internal:8201 true 1.15.6 n/a n/a n/a
vault-2 http://10.42.2.12:8200 https://vault-2.vault-internal:8201 false 1.15.6 n/a n/a 2024-06-04T09:33:22Z
vault-4 http://10.42.3.13:8200 https://vault-4.vault-internal:8201 false 1.15.6 n/a n/a 2024-06-04T09:33:22Z
vault-5 http://10.42.4.10:8200 https://vault-5.vault-internal:8201 false 1.15.6 n/a n/a 2024-06-04T09:33:21Z
vault-0 http://10.42.5.13:8200 https://vault-0.vault-internal:8201 false 1.15.6 n/a n/a 2024-06-04T09:33:26Z
And it feels like that even with database backend Vault tends to form non-even groups like with raft backend. 1.15.6 handles pods restarts better, at some point having all 6 pods in a single cluster, but then again splits to 5+1.
Any ideas why it is doing so, and unable to hold even number of nodes groups?
Same with fresh release 1.17.1
01:~$ k exec -it vault-0 -- vault operator members
Host Name API Address Cluster Address Active Node Version Upgrade Version Redundancy Zone Last Echo
--------- ----------- --------------- ----------- ------- --------------- --------------- ---------
vault-0 http://10.42.3.31:8200 https://vault-0.vault-internal:8201 true 1.17.1 n/a n/a n/a
01:~$ k exec -it vault-1 -- vault operator members
Host Name API Address Cluster Address Active Node Version Upgrade Version Redundancy Zone Last Echo
--------- ----------- --------------- ----------- ------- --------------- --------------- ---------
vault-1 http://10.42.4.31:8200 https://vault-1.vault-internal:8201 true 1.17.1 n/a n/a n/a
01:~$ k exec -it vault-2 -- vault operator members
Host Name API Address Cluster Address Active Node Version Upgrade Version Redundancy Zone Last Echo
--------- ----------- --------------- ----------- ------- --------------- --------------- ---------
vault-2 http://10.42.0.37:8200 https://vault-2.vault-internal:8201 false 1.17.1 n/a n/a 2024-06-27T08:45:27Z
vault-3 http://10.42.1.38:8200 https://vault-3.vault-internal:8201 false 1.17.1 n/a n/a 2024-06-27T08:45:27Z
vault-5 http://10.42.2.32:8200 https://vault-5.vault-internal:8201 false 1.17.1 n/a n/a 2024-06-27T08:45:27Z
vault-4 http://10.42.5.34:8200 https://vault-4.vault-internal:8201 true 1.17.1 n/a n/a n/a
Describe the bug Running
vault operator members
in pods shows that pods are being in independent groups, and sometimes are not able to find the active cluster.To Reproduce Steps to reproduce the behavior:
vault login
, provide root token upon request.vault operator members
.Expected behavior No errors while running commands above, and command in step 3 shows all 6 "nodes"
Actual behavior
vault login
sometimes produces an errorURL: GET http://127.0.0.1:8200/v1/sys/ha-status Code: 500. Errors:
Host Name API Address Cluster Address Active Node Version Upgrade Version Redundancy Zone Last Echo
vault-3 http://10.42.2.24:8200 https://vault-3.vault-internal:8201 false 1.16.3 n/a n/a 2024-05-31T10:01:35Z vault-0 http://10.42.4.21:8200 https://vault-0.vault-internal:8201 false 1.16.3 n/a n/a 2024-05-31T10:01:31Z vault-1 http://10.42.5.22:8200 https://vault-1.vault-internal:8201 true 1.16.3 n/a n/a n/a
Host Name API Address Cluster Address Active Node Version Upgrade Version Redundancy Zone Last Echo
vault-2 http://10.42.0.27:8200 https://vault-2.vault-internal:8201 true 1.16.3 n/a n/a n/a
Host Name API Address Cluster Address Active Node Version Upgrade Version Redundancy Zone Last Echo
vault-4 http://10.42.1.27:8200 https://vault-4.vault-internal:8201 true 1.16.3 n/a n/a n/a
Host Name API Address Cluster Address Active Node Version Upgrade Version Redundancy Zone Last Echo
vault-5 http://10.42.3.21:8200 https://vault-5.vault-internal:8201 true 1.16.3 n/a n/a n/a
Environment:
vault status
): 1.16.3vault version
): 1.16.3Vault server configuration file(s):
Additional context The problem we started to see after updated to 1.16.2 from 1.15.6 that sometimes vault pods are starting, but produce errors, and are not able to serve secrets to clients. Or serving stale data, not seeing newly enabled auth/updated access policy rules/etc. Upon looking into pods logs, we saw different errors like
where pods like vault-2 are trying to find vault-5 as here. Also we noted the issue above with vault login/vault operator members commands. Before updating to 1.16.2 we didn't have any of these issues.
More details about our architecture:
vault status
on all nodes is showing the correct Cluster Name/Cluster ID.