openservicemesh / osm

Open Service Mesh (OSM) is a lightweight, extensible, cloud native service mesh that allows users to uniformly manage, secure, and get out-of-the-box observability features for highly dynamic microservice environments.
https://openservicemesh.io/
Apache License 2.0
2.58k stars 277 forks source link

Kubernetes probes fail for an extra ~60s when Envoy health checks are enabled #3829

Closed nojnhuh closed 1 year ago

nojnhuh commented 3 years ago

Bug description:

When Envoy active health checks are enabled, configured Kubernetes probes (of any type: startup, liveness, or readiness) fail for an extra 60s when the Pod starts up compared to when health checks are not enabled. After that, they succeed normally.

I've already dug into this a bit and here is what I've found so far:

Click me for a lot more details I've pushed the changes I used to help debug this to [my fork](https://github.com/nojnhuh/osm/tree/healthcheck-probe-debug). To summarize, I added a `/ok` endpoint on the bookwarehouse that responds with HTTP 200, configured the bookstore-v1's health check for the bookwarehouse to use that endpoint, and removed the HTTP health check filter from the bookwarehouse, all to try to get down to the simplest possible health check configuration. Running the demo with these changes (and Prometheus enabled) yields results I wouldn't expect. The first thing I noticed is that Prometheus doesn't even seem able to scrape metrics from bookstore-v1 until it's been up for one minute. ![envoy-uptime](https://user-images.githubusercontent.com/16093815/126565318-3bed8151-be8f-4846-94ca-77ed217c2aa4.jpeg) > `envoy_server_uptime{source_namespace="bookstore"}` Looking at the health check statistics, even though I configured health checks to run once per second, I see a one-minute gap where no health checks occur, even after one was recorded as successful. The following graphs show the one-minute gap where 2 attempts have been recorded, 1 of which was successful, then health checks run and succeed once per second after: ![envoy-health-checks](https://user-images.githubusercontent.com/16093815/126565436-34e87405-666b-48ed-97f8-ac6900f19e66.jpeg) > `envoy_cluster_health_check_attempt` > `envoy_cluster_health_check_success` The Envoy container logs for the bookstore-v1 Pod match the results from these graphs. The first three health checks I see in the logs are each one minute apart: [2021-07-20 21:00:46.827][1][debug][hc] [source/common/upstream/health_checker_impl.cc:295] [C1] connection/stream error health_flags=/failed_active_hc [2021-07-20 21:01:46.752][1][debug][hc] [source/common/upstream/health_checker_impl.cc:342] [C3] hc response=200 health_flags=/failed_active_hc [2021-07-20 21:02:46.683][1][debug][hc] [source/common/upstream/health_checker_impl.cc:342] [C3] hc response=200 health_flags=healthy (full logs and envoy configs for each bookstore: https://gist.github.com/nojnhuh/dbc3072c0ca9f2c3a554b424b50cbfa3) Each health check log after those is the same as the third where `health_flags=healthy` and each occurs 1 second after the previous one. It seems odd to me that the second health check has `response=200` but `health_flags=/failed_active_hc`. When comparing Envoy logs with bookstore-v2, which is not configured to health check the bookwarehouse, it's pretty clear where the differences start. The following messages appear in the bookstore-v2 logs: [2021-07-20 21:00:49.790][1][debug][init] [source/common/init/watcher_impl.cc:14] init manager Cluster bookwarehouse/bookwarehouse/local initialized, notifying ClusterImplBase [2021-07-20 21:00:49.790][1][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:968] adding TLS cluster bookwarehouse/bookwarehouse/local bookstore-v1's Envoy has the same first message, but then they differ. Eliding some messages: [2021-07-20 21:00:47.148][1][debug][init] [source/common/init/watcher_impl.cc:14] init manager Cluster bookwarehouse/bookwarehouse/local initialized, notifying ClusterImplBase [2021-07-20 21:00:50.957][1][debug][main] [source/server/server.cc:219] Envoy is not fully initialized, skipping histogram merge and flushing stats A group of repeated messages throughout the logs seems to indicate the first one-minute gap of no health checks. This group includes messages like: [2021-07-20 21:00:55.958][1][debug][main] [source/server/server.cc:219] Envoy is not fully initialized, skipping histogram merge and flushing stats [2021-07-20 21:00:55.975][1][debug][upstream] [source/common/upstream/logical_dns_cluster.cc:159] DNS refresh rate reset for osm-controller.osm-system.svc.cluster.local, refresh rate 5000 ms [2021-07-20 21:00:56.899][1][debug][upstream] [source/common/upstream/upstream_impl.cc:279] transport socket match, socket default selected for host with address 0.0.0.0:14001 [2021-07-20 21:00:56.899][1][debug][upstream] [source/common/upstream/strict_dns_cluster.cc:170] DNS refresh rate reset for 0.0.0.0, refresh rate 5000 ms Following 12 of these groups (which reference a 5000 ms DNS refresh rate), the second health check occurs and the `adding TLS cluster` message from bookstore-v2 appears: [2021-07-20 21:01:46.752][1][debug][hc] [source/common/upstream/health_checker_impl.cc:342] [C3] hc response=200 health_flags=/failed_active_hc [2021-07-20 21:01:46.752][1][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:968] adding TLS cluster bookwarehouse/bookwarehouse/local Then, Envoy begins receiving Kubernetes probes and responding successfully: [2021-07-20 21:01:49.282][23][debug][http] [source/common/http/conn_manager_impl.cc:882] [C4][S15199497057362930131] request headers complete (end_stream=true): ':authority', '10.244.0.10:15903' ':path', '/osm-startup-probe' ':method', 'GET' 'user-agent', 'kube-probe/1.21' 'accept', '*/*' 'connection', 'close' Other requests like from Prometheus to scrape metrics, from the bookbuyer, and to the bookwarehouse also start being logged and succeeding at the same time. However, no health checks occur for another minute. Comparing the logs with bookstore-v2, it's not clear why health checks aren't running every second like they're configured to. These results are very similar to even when the bookstore-v1 defines no Kubernetes probes at all. The main difference is the Pod is marked Ready just a few seconds after it starts up vs. a minute later. The Envoy logs and Prometheus metrics show the same patterns though. One hint I found in the [Envoy source code](https://github.com/envoyproxy/envoy/blob/98c1c9e9a40804b93b074badad1cdf284b47d58b/source/server/server.cc#L210) is a comment that says that Envoy's workers will not start until it's been fully initialized, which the bookstore-v1 logs do indicate following the initial failing health check. That would also explain why all the other HTTP traffic isn't coming through until the health check passes and Envoy is initialized. Overall, I'm most curious as to why the health checks don't run once per second like they're configured to for the first two minutes. If they did, then I would expect to see Kubernetes probes passing at about the same time as when no health checks are defined, at least within a health check period.

Affected area (please mark with X where applicable):

Expected behavior: Kubernetes probes succeed at about the same time as when health checks are not configured.

Steps to reproduce the bug (as precisely as possible): Run the automated demo with health checks enabled (OSM installed with --set OpenServiceMesh.featureFlags.enableEnvoyActiveHealthChecks=true).

How was OSM installed?: Automated demo

Anything else we need to know?:

Environment:

allenlsy commented 3 years ago

I encountered the same issue before. Then I found this:

https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/health_check.proto.html#envoy-v3-api-field-config-core-v3-healthcheck-no-traffic-healthy-interval

When there is no traffic, Envoy performs active HC according to the value of no_traffic_health_interval, which by default is 60s.

I just did a local Envoy test that, if there is traffic between the Envoy and the upstream, the HC will follow health_interval.

I think the books demo might following this scenario, if permissive mode is disabled initially.

This shouldn't be a big issue problem for services that has traffic since the beginning. One problem I can see is that, since active HC has an initial interval of 60s, if within this 60s an endpoint is unhealthy, Envoy will not mark it as unhealthy immediately. It will wait until the next HC.

github-actions[bot] commented 2 years ago

Added default label size/needed. Please consider re-labeling this issue appropriately.

keithmattix commented 2 years ago

Are we still encountering this bug?

github-actions[bot] commented 1 year ago

This issue will be closed due to a long period of inactivity. If you would like this issue to remain open then please comment or update.

github-actions[bot] commented 1 year ago

Issue closed due to inactivity.