Closed Ardun21 closed 4 years ago
How can we reproduce a performance-standby node?
We tested this against a Vault Enterprise v1.2.3 HA cluster (using Consul as the back-end). Our cluster consists of 3 Vault Enterprise nodes and 5 Consul OSS nodes, but as long as you have at least two Vault Enterprise nodes running in some sort of HA cluster, I believe you should see this issue.
I just pointed my test app directly at one of the HA standby nodes (which by default run in "performance standby" mode on Vault Enterprise) and I was able to produce the health check failure.
You can verify that a given Vault Enterprise node is in performance standby mode by checking the response body of the sys/health API:
curl -k $VAULT_ADDR/v1/sys/health | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 297 100 297 0 0 461 0 --:--:-- --:--:-- --:--:-- 461
{
"initialized": true,
"sealed": false,
"standby": true,
"performance_standby": true,
"replication_performance_mode": "disabled",
"replication_dr_mode": "disabled",
"server_time_utc": 1585758353,
"version": "1.2.3+prem",
"cluster_name": "vault-cluster-cd10e9e9",
"cluster_id": "ae4fa9b2-d2ca-33d5-2d80-53255cdbdd55"
}
If you check the headers, you'll also see that performance standby nodes return a unique HTTP code, 473, which is referenced both in the above error and in the sys/health docs
Thanks. The issue is caused only when using the synchronous API. The reactive API is not affected. The underlying cause is that Spring Framework's RestTemplate
(DefaultResponseErrorHandler
) consumes the error response body twice. Once to assembly the error message and once for the responseBody
in UnknownHttpStatusCodeException
in prior to Spring Framework 5.2.5.
The issue is already addressed within Spring Framework 5.2.5 (see https://github.com/spring-projects/spring-framework/pull/24595) which then just requires an upgrade on your side.
After updating to spring-cloud-starter-vault-config version 2.2.2.RELEASE, I'm still seeing the health check report a "DOWN" status for a Vault Enterprise node which is running performance standby mode. Looking through the history of this project, it appears as though the commit to address this was added in 2.2.0, but I've tried with 2.2.0, 2.2.1, and 2.2.2 and each time I get the same results when I hit the actuator/health endpoint:
org.springframework.web.client.UnknownHttpStatusCodeException: 473 status code 473: [{"initialized":true,"sealed":false,"standby":true,"performance_standby":true,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":1585688437,"version":"1.2.3+prem","cluster_name":"vault-cluster-f21cff50","cluster_id":"a5137abb-6a0c-058f-2679-00dceb119c1a"}