Marathon app instances receive traffic during health check grace period

rogoman commented 6 years ago

Issue Type:

[x] Bug report
[ ] Feature request

What happened: I have linkerd configured with the useHealthCheck flag set to true. I scaled up one of the apps in my DC/OS cluster. Its health-check grace period is set to 60 seconds, this is because it takes quite long for this app to start up. Unfortunately, the useHealthCheck switch only affects apps that have a failed health check, not those that still have an unknown state, so Linkerd routed requests to the new instance of my app, even though it wasn't ready.

What you expected to happen: No requests should be routed to a service until Marathon marks it as healthy. Instances in a health check grace period should not receive any requests.

How to reproduce it (as minimally and precisely as possible): Configure a marathon namer with useHealthCheck:true. Run a Marathon app with a long health-check grace period. Keep sending requests to the app via linkerd. Observe what happens when you scale up such app.

Anything else we need to know?:

Environment:

linkerd/namerd version, config files: linkerd 1.3.1
Platform, version, and config files (Kubernetes, DC/OS, etc): DC/OS Enterprise 1.11

rogoman commented 6 years ago

One possible fix would be for the marathon.v2.Api object to also look at the app/healthChecks property in the JSON returned from v2/apps/[appId] to determine if an app has any health checks configured at all. If so, a task should be excluded from the load-balancing pool if the healthCheckResults property is an empty array.

dadjeibaah commented 6 years ago

Thanks for filing this @rogoman! If you are up for it, we would be happy to review a PR with the change you described above? We love receiving PRs from the community.

rogoman commented 6 years ago

@dadjeibaah PR ready: https://github.com/linkerd/linkerd/pull/2099

linkerd / linkerd

Marathon app instances receive traffic during health check grace period #2098