mesosphere / mesos-dns

DNS-based service discovery for Mesos.
https://mesosphere.github.com/mesos-dns
Apache License 2.0
484 stars 137 forks source link

Ignore inactive frameworks #517

Closed drewkerrigan closed 6 years ago

drewkerrigan commented 6 years ago

Current behavior of the record generator is to create A records for all frameworks even if "active": false. One specific scenario that this can be problematic is this:

  1. Install marathon on marathon (MoM) DC/OS package with serviceID: marathon
  2. nslookup marathon.mesos and note that there are now 2 A records for marathon.mesos
  3. Uninstall MoM
  4. nslookup marathon.mesos and note that there are still 2 A records for marathon.mesos, one of them invalid

This PR addresses this by ignoring inactive frameworks.

jdef commented 6 years ago

I think this is a bit more nuanced than the solution proposed here. Consider a similar scenario, root Marathon + MoM. The root Marathon fails over, during which time mesos-dns takes a snapshot of Mesos state: the result of such a race might be that mesos-dns only reflects an entry for MoM, right? I don't think this is very desirable. I'm happy to be wrong. How does Spartan deal with something like this?

jdef commented 6 years ago

"active" means something very specific to Mesos: https://github.com/apache/mesos/blob/49c642e98a7dac911a7d21aea1c429e979def0ab/src/master/master.hpp#L2167

urbanserj commented 6 years ago

@jdef Spartan doesn't have records for frameworks. A proper fix would be to add a validation that framework names are unique, but it's not possible.

jdef commented 6 years ago

if MoM is uninstalled properly (via Mesos teardown) then how is it even showing up in the frameworks list of the state.json?

drewkerrigan commented 6 years ago

@jdef That is actually another problem altogether related to the DC/OS UI I believe. I noticed an error / notice in the UI that said something to the effect of Could not complete teardown of framework marathon because multiple frameworkIDs exist for that name

If the user actually runs teardown manually with the correct frameworkID, then this problem doesn't exist because that framework gets moved to completed_frameworks.

jdef commented 6 years ago

it's not an orthogonal problem. the scenario you're solving for in the description here shouldn't happen if the MoM is being uninstalled properly.

drewkerrigan commented 6 years ago

Valid point, I still wonder if we should be creating records for inactive frameworks, but considering the default refresh period is 60 seconds, probably best to leave the records there.

jdef commented 6 years ago

Well, like I said earlier, "active" means something very specific to Mesos. It does not distinguish between DC/OS "installed" vs. "uninstalled". What other use case(s) are you trying to solve for? If none, then I think we should cancel this PR

justinrlee commented 6 years ago

I don't think there's a requirement at all that framework names be unique in Mesos. FrameworkIDs, yes, but not names.

I think this calls for a larger discussion w.r.t DC/OS - if a framework is inactive but has active tasks, should we be routing to tasks on it (and/or providing DNS records for those tasks)? How about inactive with no active tasks? In my opinion, "yes" to the former, because the underlying task health indicates that stuff is "still going on", but "no" to the latter, because it indicates that a framework was not properly deregistered.

I'm very open to being wrong here, but I was hoping we could have a quick discussion before we list this as a 'do not do'.

jdef commented 6 years ago

Its a big problem for mesos-dns that it doesn't watch the master for events, and instead it relies on periodic snapshots. But I don't think there's any alternative given that events emitted by master don't actually contain all the task metadata needed to generate appropriate records. So we're stuck w/ state snapshots for now. Trying to make decisions based on state that's possibly quite stale is error prone and we'll never satisfy everyone's use cases. We (mesos-dns) need to pick a reasonable middle ground and try to not over-fit a particular problematic scenario. Luckily DC/OS ships w/ an additional service lookup/network load balancing solution that's much more responsive to cluster dynamics than is mesos-dns.

Framework names ARE NOT guaranteed to be unique and this has actually proved to be a problem in practice for many clusters. I'm pretty sure that Mesos is going to ship w/ a flag, at some point, that allows someone to enforce a name uniqueness constraint - and that we'll end up incorporating that into DC/OS. The possibility of duplicate names presents several problems for DC/OS clusters. https://issues.apache.org/jira/browse/MESOS-1719?focusedCommentId=16126633&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16126633

Also, your latter case doesn't guarantee that the framework was not properly deregistered - a framework could legitimately have zero tasks and being in the process of failing over.

On Tue, Mar 27, 2018 at 4:25 AM, Justin Lee notifications@github.com wrote:

I don't think there's a requirement at all that framework names be unique in Mesos. FrameworkIDs, yes, but not names.

I think this calls for a larger discussion w.r.t DC/OS - if a framework is inactive but has active tasks, should we be routing to tasks on it (and/or providing DNS records for those tasks)? How about inactive with no active tasks? In my opinion, "yes" to the former, because the underlying task health indicates that stuff is "still going on", but "no" to the latter, because it indicates that a framework was not properly deregistered.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mesosphere/mesos-dns/pull/517#issuecomment-376439555, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPVLB0q7DqztHDu2isprYWQ62WuBK0yks5tifd4gaJpZM4ST2L6 .