mesosphere / mesos-dns

DNS-based service discovery for Mesos.
https://mesosphere.github.com/mesos-dns
Apache License 2.0
484 stars 137 forks source link

Health awareness #310

Open tsenart opened 8 years ago

tsenart commented 8 years ago

Mesos-DNS as a service discovery system should be health-aware. This doesn't mean that it can guarantee healthiness of the returned service instances, only that it does its best to direct clients to capable ones.

With that in mind, we should take into consideration the TaskStatus.healthy field and work with the Marathon and Mesos teams to promote the use of Mesos native health checks.

imriz commented 6 years ago

Is this still true? Mesos DNS will publish unhealthy instances, even if they use Mesos native health checks in Marathon (MESOS_HTTP(S))?

jdef commented 6 years ago

I don't think anyone is working on this.

On Wed, Mar 28, 2018 at 12:14 PM, Imri Zvik notifications@github.com wrote:

Is this still true? Mesos DNS will publish unhealthy instances, even if they use Mesos native health checks in Marathon (MESOS_HTTP(S))?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mesosphere/mesos-dns/issues/310#issuecomment-376945254, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPVLNIVUaP-xRKFWMVtFcrAl75Io33Jks5ti7bhgaJpZM4GKasw .

imriz commented 5 years ago

This is a really needed feature. Currently, mesos-dns will happily announce unhealthy instances, which puts the burden on figuring out the health to the client (which might need few retries to get an healthy instance).

Looking at https://github.com/mesosphere/mesos-dns/blob/master/records/state/state.go#L193 this seems to be quite simple? The state JSON statuses hash (same place where task state is) will contain the healthy boolean if the task has health check configured and running. If not, it will be omitted.

So it seems a really easy fix would be to omit the record if the healthy field is there, and is set to false.

Any thoughts about it?

jdef commented 5 years ago

Related: https://lists.apache.org/thread.html/f79dbb92a0a43c00548ee503a0abbe3e1dd983511747ee77f2fd7966@%3Cdev.mesos.apache.org%3E

imriz commented 5 years ago

I would also be glad to distinguish between "grace did not pass yet" to "no health check defined", but for now, the lack of awareness whatsoever is even worse than not distinguishing these two scenarios. If we treat missing healthy field as "healthy" (and publish such record) we keep backward compatibility by not affecting tasks without health tasks, with the trade off of publishing unhealthy instances during their grace period (which is already happening today anyway).

Bottom line is that this feature is left unanswered for years, and I bet a lot of the users of this project would wish to see it implemented, even if it is not fully covering all scenarios today (maybe add a config flag to enable/disable this).