mesosphere / marathon-lb

Marathon-lb is a service discovery & load balancing tool for DC/OS
Apache License 2.0
449 stars 300 forks source link

[question] How to improve the stability? #441

Closed robsonpeixoto closed 5 years ago

robsonpeixoto commented 7 years ago

Hi guys! I'm using the stack Mesos + Marathon + Marathon-lb but we are getting some troubles :(

For some unknown reason, the marathon-lb are showing these problem:

And I have some question:

I'll try to reduce the number of old haproxy process using the below solution. What your opinion?

We are using marathon-lb as as L4 and L7 load balance. But for each created task(deploy/task failure/...) the marathon-lb will create a new configuration version and will reload haproxy. But it keep old process running waiting to all ports close.

To avoid this problem I'll create two marathon-lb instances. One instance to http(L7) services and another one to tcp(L4) ((redis/thrift/...)) services.

On the L7 LB(load balancer) I'll use this script to kill old process in a cron running every 5 minutes. It will ensure that all haproxy hold by a websocket connection will be killed.

As our tcp service are very stable and has few deploys, it will not be affect other apps problems.

Any suggestion how to make it works better?

Some of my server info server info:

/usr/sbin/mesos-slave --advertise_ip=10.0.0.1 --cgroups_enable_cfs --cgroups_hierarchy=/cgroup --containerizers=docker,mesos --docker_stop_timeout=50secs --executor_registration_timeout=10mins --executor_shutdown_grace_period=60secs --isolation=cgroups/cpu,cgroups/mem --log_dir=/var/log/mesos --logging_level=INFO --master=zk://zk-1:2181,zk-2:2181,zk-3:2181/mesos_20160425 --port=5051 --recover=reconnect --strict --no-switch_user --work_dir=/tmp/mesos

Thanks

JayH5 commented 7 years ago

Hi @robsonpeixoto,

We're only using marathon-lb as a L7 load-balancer with HTTP 1. I've been meaning to try set up a L4 load-balancer but we haven't got there yet. So I'm not sure I can help that much but I can maybe answer some of these questions...

Is everyone running marathon in a docker container?

Not marathon, no. Did you mean marathon-lb? marathon-lb we run in a container.

Is possible to run outside docker container?

Running marathon-lb outside a container should work but some functionality may break. The Lua scripts used for some of the API endpoints are designed with the assumption that only one process called "haproxy" is present on the system.

What's the docker version are you using? And Storage Driver?

Currently have marathon-lb running on Docker 1.11.2 (overlay), 1.12.1 (aufs), and 1.13.1 (overlay2) on various versions of DC/OS and standalone Marathon/Mesos. Docker hasn't really been an issue for us with marathon-lb.

The only other thing I can point you to is this repo: https://github.com/praekeltfoundation/docker-marathon-lb where we override some of the default templates. Important changes include:

mikeantonelli commented 7 years ago

@JayH5 We were able to upgrade HAProxy's HTTP checks from 1.0 to 1.1 by setting the following labels in our service definitions:

  ...
  "labels": {
    "HAPROXY_0_BACKEND_HTTP_HEALTHCHECK_OPTIONS": "  option httpchk GET {healthCheckPath} HTTP/1.1\\r\\nHost:\\ www\r\n",
    "HAPROXY_0_BACKEND_HTTP_OPTIONS": ""
  },
  ...

A few notes regarding the specific formatting:

Tested With:

FWIW: I need to dig into the marathon-lb code and issue a Pull Request - we have 80 services deployed and this has become a lame patch we add to every service that wants a 1.1 health check.

robsonpeixoto commented 7 years ago

Thanks @JayH5

jkoelker commented 5 years ago

As the method of launching and restarting has changed with v1.12, it should be better about random pauses and slow config updates. Although if a client has a long running connection open, that will still block it from reloading (I've seen some instances where it takes up to an hour to drain all the connections from the old processes).