[question] How to improve the stability?

robsonpeixoto commented 7 years ago

Hi guys! I'm using the stack Mesos + Marathon + Marathon-lb but we are getting some troubles :(

For some unknown reason, the marathon-lb are showing these problem:

A new app are add/updated and the marathon-lb do not get the service
I have 2 instances of marathon-lb. When I access the haproxy status page, it show for the same service in one as healthy and to another one unhealthy
Just stop do respond and the docker stop do respond.

And I have some question:

Is everyone running marathon in a docker container?
Is possible to run outside docker container?
How do you deal with persistent connection? Do you kill old haproxy process?
What's the docker version are you using? And Storage Driver?

I'll try to reduce the number of old haproxy process using the below solution. What your opinion?

We are using marathon-lb as as L4 and L7 load balance. But for each created task(deploy/task failure/...) the marathon-lb will create a new configuration version and will reload haproxy. But it keep old process running waiting to all ports close.

To avoid this problem I'll create two marathon-lb instances. One instance to http(L7) services and another one to tcp(L4) ((redis/thrift/...)) services.

On the L7 LB(load balancer) I'll use this script to kill old process in a cron running every 5 minutes. It will ensure that all haproxy hold by a websocket connection will be killed.

As our tcp service are very stable and has few deploys, it will not be affect other apps problems.

Any suggestion how to make it works better?

Some of my server info server info:

Docker:

Server Version: 1.10.3
Storage Driver: aufs
Kernel Version: 3.13.0-112-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 6.826 GiB

Mesos: 0.28.2-2.0.27.ubuntu1404 /usr/sbin/mesos-master --cluster=jusbrasil-mesos-prod --log_dir=/var/log/mesos --logging_level=INFO --port=5050 --quorum=2 --work_dir=/tmp/mesos --zk=zk://zk-1:2181,zk-2:2181,zk-3:2181/mesos_20160425

/usr/sbin/mesos-slave --advertise_ip=10.0.0.1 --cgroups_enable_cfs --cgroups_hierarchy=/cgroup --containerizers=docker,mesos --docker_stop_timeout=50secs --executor_registration_timeout=10mins --executor_shutdown_grace_period=60secs --isolation=cgroups/cpu,cgroups/mem --log_dir=/var/log/mesos --logging_level=INFO --master=zk://zk-1:2181,zk-2:2181,zk-3:2181/mesos_20160425 --port=5051 --recover=reconnect --strict --no-switch_user --work_dir=/tmp/mesos

Marathon: marathon-1.1.7 java -Xms512m -Xmx2048m -server -jar /opt/marathon/marathon-1.1.7/target/scala-2.11/marathon-assembly-1.1.7.jar --enable_features task_killing --event_subscriber http_callback --master zk://zk-1:2181,zk-2:2181,zk-3:2181/mesos_20160425 --task_launch_timeout 600000 --task_lost_expunge_gc 75000 --task_lost_expunge_initial_delay 300000 --task_lost_expunge_interval 30000 --zk zk://zk-1:2181,zk-2:2181,zk-3:2181/marathon_20160425
Marathon-lb: v1.6.0 sse -m http://mesos-1:8080 http://mesos-2:8080 http://mesos-3:8080 --group external --group internal --syslog-socket /dev/log

Thanks

JayH5 commented 7 years ago

Hi @robsonpeixoto,

We're only using marathon-lb as a L7 load-balancer with HTTP 1. I've been meaning to try set up a L4 load-balancer but we haven't got there yet. So I'm not sure I can help that much but I can maybe answer some of these questions...

Is everyone running marathon in a docker container?

Not marathon, no. Did you mean marathon-lb? marathon-lb we run in a container.

Is possible to run outside docker container?

Running marathon-lb outside a container should work but some functionality may break. The Lua scripts used for some of the API endpoints are designed with the assumption that only one process called "haproxy" is present on the system.

What's the docker version are you using? And Storage Driver?

Currently have marathon-lb running on Docker 1.11.2 (overlay), 1.12.1 (aufs), and 1.13.1 (overlay2) on various versions of DC/OS and standalone Marathon/Mesos. Docker hasn't really been an issue for us with marathon-lb.

The only other thing I can point you to is this repo: https://github.com/praekeltfoundation/docker-marathon-lb where we override some of the default templates. Important changes include:

Disabling HTTP health checks-- only use TCP health checks. While Marathon does HTTP 1.1 health checks, HAProxy does HTTP 1.0 which caused some weird inconsistencies with certain apps.
Disabling frontends for service ports. We had issues with clashing port numbers.

mikeantonelli commented 7 years ago

@JayH5 We were able to upgrade HAProxy's HTTP checks from 1.0 to 1.1 by setting the following labels in our service definitions:

  ...
  "labels": {
    "HAPROXY_0_BACKEND_HTTP_HEALTHCHECK_OPTIONS": "  option httpchk GET {healthCheckPath} HTTP/1.1\\r\\nHost:\\ www\r\n",
    "HAPROXY_0_BACKEND_HTTP_OPTIONS": ""
  },
  ...

A few notes regarding the specific formatting:

It's important to add these values using JSON and not the web forms because characters are escaped.
HAPROXY_0_BACKEND_HTTP_HEALTHCHECK_OPTIONS has an intentional leading two spaces.
HAPROXY_0_BACKEND_HTTP_OPTIONS with an empty-string value is intentional to fix a newline issue when HAPROXY_0_BACKEND_HTTP_HEALTHCHECK_OPTIONS is overridden.

Tested With:

DC/OS 1.8.7 EE
Marathon-lb 1.4.3, 1.6.0

FWIW: I need to dig into the marathon-lb code and issue a Pull Request - we have 80 services deployed and this has become a lame patch we add to every service that wants a 1.1 health check.

robsonpeixoto commented 7 years ago

Thanks @JayH5

jkoelker commented 5 years ago

As the method of launching and restarting has changed with v1.12, it should be better about random pauses and slow config updates. Although if a client has a long running connection open, that will still block it from reloading (I've seen some instances where it takes up to an hour to drain all the connections from the old processes).

mesosphere / marathon-lb

[question] How to improve the stability? #441