mesosphere / marathon-lb

Marathon-lb is a service discovery & load balancing tool for DC/OS
Apache License 2.0
450 stars 300 forks source link

marathon-lb reload bug #602

Open Sisphyus opened 6 years ago

Sisphyus commented 6 years ago

Last week when we update a core service in our production environment(build with DC/OS). we accidentally make a mistake when change the health check configuration. and we get 503 return all the time from external access until we make health check configuration correctly and restart service . the old instance state is always healthy in marathon page. so we think something happened when marathon-lb reload.

why old healthy instance lose efficacy after we make a bad health check ?As we know nothing changed with old healthy instance when we lunch a new unhealthy instance in same application.

Test and Verification(marathon-lb version 1.12.1)

  1. a new nginx(listen 80) test application lunched(health check port 80)
  2. change health check port to 81 (marathon lunch a new instance and its state will never be healthy, at this time the nginx backend in haproxy.cfg has two different server)
  3. test external access

haproxy.cfg

before reload

backend nginx-lbl-test_10278
  balance roundrobin
  mode http
  option forwardfor
  http-request set-header X-Forwarded-Port %[dst_port]
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  server 10_168_0_82_9_0_5_7_80 9.0.5.7:80 check inter 5s fall 4 port 80

after reload

backend nginx-lbl-test_10278
  balance roundrobin
  mode http
  option forwardfor
  http-request set-header X-Forwarded-Port %[dst_port]
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  server 10_168_0_82_9_0_5_7_80 9.0.5.7:80 check inter 5s fall 4 port 81
  server 10_168_0_82_9_0_5_12_80 9.0.5.12:80 check inter 5s fall 4 port 81

so why old instance health check configuration also has been updated?

It's terrible when we update some application in production environment. haproxy failover lose efficacy when you make a bad health check even the old healthy instance is still alive.