Performance Standby still causing health check to fail

Ardun21 commented 4 years ago

After updating to spring-cloud-starter-vault-config version 2.2.2.RELEASE, I'm still seeing the health check report a "DOWN" status for a Vault Enterprise node which is running performance standby mode. Looking through the history of this project, it appears as though the commit to address this was added in 2.2.0, but I've tried with 2.2.0, 2.2.1, and 2.2.2 and each time I get the same results when I hit the actuator/health endpoint:

org.springframework.web.client.UnknownHttpStatusCodeException: 473 status code 473: [{"initialized":true,"sealed":false,"standby":true,"performance_standby":true,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":1585688437,"version":"1.2.3+prem","cluster_name":"vault-cluster-f21cff50","cluster_id":"a5137abb-6a0c-058f-2679-00dceb119c1a"}

mp911de commented 4 years ago

How can we reproduce a performance-standby node?

Ardun21 commented 4 years ago

We tested this against a Vault Enterprise v1.2.3 HA cluster (using Consul as the back-end). Our cluster consists of 3 Vault Enterprise nodes and 5 Consul OSS nodes, but as long as you have at least two Vault Enterprise nodes running in some sort of HA cluster, I believe you should see this issue.

I just pointed my test app directly at one of the HA standby nodes (which by default run in "performance standby" mode on Vault Enterprise) and I was able to produce the health check failure.

You can verify that a given Vault Enterprise node is in performance standby mode by checking the response body of the sys/health API:

curl -k $VAULT_ADDR/v1/sys/health | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   297  100   297    0     0    461      0 --:--:-- --:--:-- --:--:--   461
{
  "initialized": true,
  "sealed": false,
  "standby": true,
  "performance_standby": true,
  "replication_performance_mode": "disabled",
  "replication_dr_mode": "disabled",
  "server_time_utc": 1585758353,
  "version": "1.2.3+prem",
  "cluster_name": "vault-cluster-cd10e9e9",
  "cluster_id": "ae4fa9b2-d2ca-33d5-2d80-53255cdbdd55"
}

If you check the headers, you'll also see that performance standby nodes return a unique HTTP code, 473, which is referenced both in the above error and in the sys/health docs

mp911de commented 4 years ago

Thanks. The issue is caused only when using the synchronous API. The reactive API is not affected. The underlying cause is that Spring Framework's RestTemplate (DefaultResponseErrorHandler) consumes the error response body twice. Once to assembly the error message and once for the responseBody in UnknownHttpStatusCodeExceptionin prior to Spring Framework 5.2.5.

The issue is already addressed within Spring Framework 5.2.5 (see https://github.com/spring-projects/spring-framework/pull/24595) which then just requires an upgrade on your side.

spring-cloud / spring-cloud-vault

Performance Standby still causing health check to fail #397