mesos / mesos-go

Go language bindings for Apache Mesos
Apache License 2.0
544 stars 146 forks source link

support multiple master candidates #352

Closed huahuiyang closed 5 years ago

huahuiyang commented 6 years ago

In this pr, if the mesos framework specify a candidate selector(could be dynamic func), the framework will gain the ability to failover across multiple mesos masters. otherwise, it will only rely on the url as an initial configuration.

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-0.3%) to 57.547% when pulling 12603e0bbdb146dce93992a53d04c2c6d6fdaeb1 on huahuiyang:master into 5a67a247eead1fa87fa3d499fd05884b2f5f1188 on mesos:master.

huahuiyang commented 6 years ago

this is to support multiple master api urls when start the framework, support multiple mesos api urls failover in attempts. And does not break current c.url logic.

huahuiyang commented 6 years ago

related https://github.com/mesos/mesos-go/issues/339

huahuiyang commented 6 years ago

@jdef would you like to take a look when you get a chance?

huahuiyang commented 6 years ago

@tsenart @vladimirvivien @jdef any ideas? does this repo still in maintaining? seems it has been a long time after this pr was created.

jdef commented 6 years ago

Thanks for the PR! This repo is maintained, and I'm basically the only maintainer at this point .. so sometimes it takes me while to get around to reviewing PRs. Thanks for being patient.

It looks like the use case you're trying to address is this:

  1. A cluster has multiple master candidates
  2. The default HTTP client is expected to always hit some "primary" master candidate (the first one in the list); if this master is down then calls will fail.
  3. The scheduler HTTP client will round-robin across all candidates if an API call results in a non-redirection error; for redirects, continue to use the candidate suggested by the result returned by Mesos.

Does that sound about right?

These candidates are more like "bootstrap" endpoints, right? Because in a cloudy environment where servers come and go, it's possible that the initial nodes might cycle out (unless you were using floating IPs that remaining the same across recycled instances).

I'm interested in the specifics if your use case. Please elaborate in the PR description.

huahuiyang commented 6 years ago

@jdef Thanks for looking at this pr, and the items you listed in the comment above are pretty much correct! Setup a load balancer in front of mesos masters, providing a non-changed endpoint is a solution, but having some drawbacks:

  1. a load balancer is a must have component in that scenario (dns based service discovery takes time to sync, so we assume people using nginx/haproxy/lvs etc. as a load balancer), developer would be frustrated when there is not load balancer in their organization before this pr.
  2. setup a load balancer to proxy active master, introduce an extra hop which might be better to mitigate if we care performance a lot.

In our case, we choose to use mesos masters raw ip/dns directly (akka. candidates in this pr), we are aware of mesos master addresses could be changed totally, but in a high availability, fault tolerant mesos scheduler, the active/standby schedulers could easily rolling restart to load the latest mesos master candidates in case. So we propose another candidates option in this pr to initiate the mesos framework, which is not breaking origin single url register way, to let developer make the choice according to their use case.

jdef commented 6 years ago

OK. Let me then suggest the following changes to this PR:

  1. don't change the httpcli package at all since what you really want are changes to the httpsched client; create an Option func in httpsched handles candidate selection
  2. loosen up the candidate specification: instead of some encoded string, maybe a candiate selector option is some kind of a func() string. then you could define a "static candidate selector" that cycles over some fixed slice of candidate strings. but a user could invent a new implementation that was more dynamic and not limited to an initial, fixed set of strings. let me know if you need/want me to elaborate on this idea more.
jdef commented 6 years ago

e.g.

type CandidateSelector func() string

func FixedCandidateSelector(s []string) CandidateSelector { ... } // cycles over s
huahuiyang commented 6 years ago

@jdef fair enough, i've changed the pr according to your comments. In this pr, if the mesos framework specify a candidate selector(could be dynamic func), the framework will gain the ability to failover across multiple mesos masters. otherwise, it will only rely on the url as an initial configuration.

an example might be as follows:

masters := "http://master1:5050/api/v1/scheduler," +
    "http://master2:5050/api/v1/scheduler," +
    "http://master3:5050/api/v1/scheduler"
candidateIndex := 0
candidatesRoundRobinSelector := func() string {
    if len(strings.Split(masters, ",")) == 0 {
        return ""
    }
    if candidateIndex >= len(strings.Split(cfg.Masters, ",")) {
        candidateIndex = 0
    }
    res := strings.Split(cfg.Masters, ",")[candidateIndex]
    candidateIndex++
    return res
}

httpsched.NewCaller(cli,
    httpsched.AllowReconnection(true),
    httpsched.MasterCandidates(candidatesRoundRobinSelector))
huahuiyang commented 6 years ago

great, after upstream this change, our prod system do not need to maintain a forked mesos-go repo for multiple candidates purpose in our organization. i will think of the unit test part in another pr.

jdef commented 5 years ago

@huahuiyang please rebase so that I can merge this

huahuiyang commented 5 years ago

@jdef rebased.

huahuiyang commented 5 years ago

ping @jdef

huahuiyang commented 5 years ago

@jdef any chance for you to take a look and get this pr merged? thanks

jdef commented 5 years ago

LGTM, thanks