uber / ringpop-go

Scalable, fault-tolerant application-layer sharding for Go applications
http://www.uber.com
MIT License
827 stars 81 forks source link

Rolling update problem #175

Open teodor-pripoae opened 8 years ago

teodor-pripoae commented 8 years ago

Hi,

I'm experiencing some issues when doing a rolling update over my ringpop cluster.

I'm running the cluster on top of Kubernetes with a headless service for peer communication. Every DNS query to this service returns a list of all ringpop IPs in the cluster.

I implemented the Kubernetes host provider like this:

// KubeProvider returns a list of hosts for a kubernetes headless service
type KubeProvider struct {
    svc  string
    port int
}

func NewKubeProvider(svc string, port int) discovery.DiscoverProvider {
    provider := &KubeProvider{
        svc:  svc,
        port: port,
    }
    return provider
}

func (k *KubeProvider) Hosts() ([]string, error) {
    addrs, err := net.LookupHost(k.svc)
    if err != nil {
        return nil, errors.Trace(err)
    }

    for i := range addrs {
        addrs[i] = fmt.Sprintf("%s:%d", addrs[i], k.port)
    }
    return addrs, nil
}

During a rolling update, old ringpop services are stopped one by one and new ringpop services are created with a different IP. When a new ringpop service starts, it may see old or new ips in the hosts list.

I'm running 2 instances in the cluster right now, one simply fails to start::

{"level":"info","msg":"GossipAddr: 10.244.3.5:18080","time":"2016-08-18T17:43:46Z"}
{"level":"error","msg":"unable to count members of the ring for statting: \"ringpop is not bootstrapped\"","time":"2016-08-18T17:43:46Z"}
{"cappedDelay":60000,"initialDelay":100000000,"jitteredDelay":58434,"level":"warning","local":"10.244.3.5:18080","maxDelay":60000000000,"minDelay":51200,"msg":"ringpop join attempt delay reached max","numDelays":10,"time":"2016-08-18T17:45:01Z","uncappedDelay":102400}
{"joinDuration":134374138254,"level":"warning","local":"10.244.3.5:18080","maxJoinDuration":120000000000,"msg":"max join duration exceeded","numFailed":12,"numJoined":0,"startTime":"2016-08-18T17:43:46.377647091Z","time":"2016-08-18T17:46:00Z"}
{"err":"join duration of 2m14.374138254s exceeded max 2m0s","level":"error","local":"10.244.3.5:18080","msg":"bootstrap failed","time":"2016-08-18T17:46:00Z"}
{"error":"join duration of 2m14.374138254s exceeded max 2m0s","level":"info","msg":"bootstrap failed","time":"2016-08-18T17:46:00Z"}
{"level":"fatal","msg":"[ringpop bootstrap failed: join duration of 2m14.374138254s exceeded max 2m0s]","time":"2016-08-18T17:46:00Z"}

The other one attempts to connect to the first and fails all the periodic health checks.

{"level":"info","msg":"GossipAddr: 10.244.1.7:18080","time":"2016-08-18T17:43:46Z"}
{"level":"error","msg":"unable to count members of the ring for statting: \"ringpop is not bootstrapped\"","time":"2016-08-18T17:43:46Z"}
{"level":"error","msg":"unable to count members of the ring for statting: \"ringpop is not bootstrapped\"","time":"2016-08-18T17:43:46Z"}
{"joined":["10.244.3.5:18080","10.244.3.5:18080"],"level":"info","msg":"bootstrap complete","time":"2016-08-18T17:43:49Z"}
{"level":"info","msg":"Running on :8080 using 1 processes","time":"2016-08-18T17:43:49Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.0.6:18080","time":"2016-08-18T17:43:49Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"ping request target unreachable","target":"10.244.3.5:18080","time":"2016-08-18T17:43:49Z"}
{"error":"join timed out","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:43:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"ping request target unreachable","target":"10.244.3.5:18080","time":"2016-08-18T17:43:50Z"}
{"level":"info","local":"10.244.1.7:18080","member":"10.244.3.5:18080","msg":"executing scheduled transition for member","state":"suspect","time":"2016-08-18T17:43:54Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:43:54Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:44:20Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:44:20Z"}
{"error":"JSON call failed: map[type:error message:node is not ready to handle requests]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:44:20Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:44:20Z"}
{"latency":"1.323232ms","level":"info","method":"GET","msg":"","request_id":"e21bcc3f05fa04449ab4b9f0520e0933","time":"2016-08-18T17:44:30Z","url":"/_internal/cluster-info"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:44:30Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:44:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:44:50Z"}
{"error":"JSON call failed: map[message:node is not ready to handle requests type:error]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:44:50Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:44:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:45:20Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:45:20Z"}
{"error":"JSON call failed: map[type:error message:node is not ready to handle requests]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:45:20Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:45:20Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:45:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:45:50Z"}
{"error":"JSON call failed: map[type:error message:node is not ready to handle requests]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:45:50Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:45:50Z"}

Eventually, the first pod timeouts, it is restarted by the cluster manager and it successfully connects to the second pod.

Is it related to #146 ?

dansimau commented 8 years ago

Hi @teodor-pripoae :)

Is it related to #146 ?

How are you initialising Ringpop? Are you providing an Identity argument?

If you are, then it's possible you are affected by #146. Otherwise, if you are using the builtin identity provider then it's not possible for ringpop to bootstrap without TChannel listening (it will throw an error).

teodor-pripoae commented 8 years ago

Hi,

I'm initializing it using the GossipAddr from our config. It is printed as the first line in the logs above and it is equal to the pod IP.

ch, err := tchannel.NewChannel("avcache-proxy", nil)
if err != nil {
  return nil, errors.Trace(fmt.Errorf("channel did not create successfully: %v", err))
}

rp, err := ringpop.New("avcache",
  ringpop.Channel(ch),
  ringpop.Identity(config.GossipAddr),
  ringpop.Logger(bark.NewLoggerFromLogrus(log.LogrusLogger())),
)
dansimau commented 8 years ago

Are you calling ringpop.Bootstrap before (or without) calling tchannel.ListenAndServe anywhere?

teodor-pripoae commented 8 years ago

Hi,

I'm running ListenAndServe on the tchannel first. Here is all code for my server:

type Server struct {
    ringpop      *ringpop.Ringpop
    channel      *tchannel.Channel
    proxy        *proxy.Proxy
    peerProvider discovery.DiscoverProvider
}

func NewPeer(proxyServer *proxy.Proxy) (*Server, error) {
    ch, err := tchannel.NewChannel("avcache-proxy", nil)
    if err != nil {
        return nil, errors.Trace(fmt.Errorf("channel did not create successfully: %v", err))
    }

    rp, err := ringpop.New("avcache",
        ringpop.Channel(ch),
        ringpop.Identity(config.GossipAddr),
        ringpop.Logger(bark.NewLoggerFromLogrus(log.LogrusLogger())),
    )
    if err != nil {
        return nil, errors.Trace(fmt.Errorf("Unable to create Ringpop: %v", err))
    }

    server := &Server{
        channel:      ch,
        ringpop:      rp,
        proxy:        proxyServer,
        peerProvider: config.PeerProvider(),
    }

    if err := server.RegisterProxy(); err != nil {
        return nil, errors.Trace(fmt.Errorf("could not register proxy handler: %v", err))
    }

    if err := server.channel.ListenAndServe(config.GossipAddr); err != nil {
        return nil, errors.Trace(fmt.Errorf("could not listen on given hostport: %v", err))
    }

    opts := new(swim.BootstrapOptions)
    opts.DiscoverProvider = server.peerProvider

    if _, err := server.ringpop.Bootstrap(opts); err != nil {
        return nil, errors.Trace(fmt.Errorf("ringpop bootstrap failed: %v", err))
    }
    return server, nil
}

func (w *Server) RegisterProxy() error {
    hmap := map[string]interface{}{"/proxy": w.ProxyHandler}

    return json.Register(w.channel, hmap, func(ctx context.Context, err error) {
        log.Debugf("error occured: %v", err)
    })
}
dansimau commented 8 years ago

I see, I don't think this is related to #146 then.

However, I have seen something similar before with a two-node cluster (see #30).

Can you try adding a third node and see if you still get the same problem?

teodor-pripoae commented 8 years ago

It behaves the same with 3 nodes after a rolling restart. I think when one pod is stopped, the kubernetes dns still returns it's ip, so when a fresh pod starts it will also get a few old ips in the peer list.

Then, it may try to connect to the stopped pods, and it fails. Am I right ?

thanodnl commented 8 years ago

@teodor-pripoae It sounds very interesting to run a ringpop application on kubernetes but we haven't done so in the past. To try and reproduce your issues during rolling restarts I created an example on a separate branch and have a raw README.md of commands that I ran to setup the cluster and execute a rolling upgrade. Unfortunately I was not able to reproduce the problems you described.

If it is a DNS caching issue I would not expect it to be a problem in a 3 node setup. From my example and testing in the linked branch I experienced that kubernetes scales up the running applications to a minimum of 4 nodes during the upgrade on a replication controller of 3 nodes. If the old addresses would show up in the cached dns results it would still allow for the new node to connect to 3 existing nodes.

Does the linked example represent your setup or do you have a slightly different setup? Can you make changes to the example to reproduce your experienced problems during the rolling restart. That would make debugging a lot easier for me.

For now my best guess would be that you run into conflicts of the number of hosts returned by your DiscoveryProvider and the JoinSize in the swim.BootstrapOptions passed in to the Bootstrap function of ringpop which defaults to 3. This could cause ringpop to fail its bootstrap when there are less then 3 nodes online while the discovery provider tells it there should be 3 or more online, potentially caused by dns caches.