Closed ejether closed 8 years ago
Sounds a lot like issues we've had in the past. HAProxy, when doing a soft-reload, was spawning a new process with the new config and kept the old process around to finish processing requests currently in progress. But it was also accepting new requests, so the old and new config were both staying alive. Updating HAProxy to the most recent version fixed that issue for us.
Along with that, at least on systemd, haproxy seems to have more issues reloading properly:
We are using the marathon-lb docker container which is running HAProxy 1.5.8. What version did you find fixed your problem @flosell?
we are running HAProxy 1.5.14 right now
When this occurs, can you check how many instances of HAProxy are running? As in, do a ps
or pidof haproxy
. Can you also check the output of docker ps
? I think this may be similar to #72.
I will do that @brndnmtthws: I don't konw for sure if the problem is occurring on this node (its hard to find until something in particular starts throwing errors) but this is a typical output.
ejetherington@mesos-master-5vzd:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
65c0b9df1fb6 mesosphere/marathon-lb:v1.0.1 "/marathon-lb/run sse" 7 days ago Up 7 days marathon-lb
ejetherington@mesos-master-5vzd:~$ docker exec marathon-lb ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 20052 2988 ? Ss Feb10 0:00 /bin/bash /marathon-lb/run sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 12 0.0 0.0 4092 700 ? S Feb10 0:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 13 0.7 0.1 65372 21924 ? S Feb10 86:42 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 14 0.0 0.0 21084 3936 ? S Feb10 4:17 /bin/bash ./run
root 27 0.0 0.0 26756 6452 ? Ss Feb10 7:06 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf
root 5782 0.1 0.0 26000 5708 ? Ss Feb12 11:53 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 5770
root 13859 0.0 0.0 26000 5760 ? Ss Feb14 1:42 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 13365
root 17276 0.0 0.0 26080 5840 ? Ss Feb11 0:58 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 17264
root 17915 0.0 0.0 26080 4572 ? Ss 15:02 0:01 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 17905
root 20407 0.0 0.0 4228 680 ? S 15:43 0:00 sleep 1
root 20408 0.0 0.0 17492 2120 ? Rs 15:43 0:00 ps aux
root 22709 0.0 0.0 26348 5992 ? Ss Feb17 0:42 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 9208
ejetherington@mesos-master-5vzd:~$ ps aux | grep haproxy
root 1105 0.0 0.0 4092 700 ? S Feb10 0:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 1106 0.7 0.1 65372 21924 ? S Feb10 86:42 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 1122 0.0 0.0 26756 6452 ? Ss Feb10 7:06 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf
root 8271 0.0 0.0 26080 5840 ? Ss Feb11 0:58 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 17264
root 15134 0.0 0.0 26000 5760 ? Ss Feb14 1:42 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 13365
root 18360 0.0 0.0 26348 5992 ? Ss Feb17 0:42 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 9208
root 21907 0.0 0.0 26080 4572 ? Ss 15:02 0:01 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 17905
ejether+ 26252 0.0 0.0 10472 2116 pts/7 S+ 15:43 0:00 grep --color=auto haproxy
root 32450 0.1 0.0 26000 5708 ? Ss Feb12 11:53 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 5770
ejetherington@mesos-master-5vzd:~$
Interesting. You definitely have some extra haproxy instances there. I wonder if the use of flock
isn't working as one might expect. Do you have any long-lived TCP connections going through haproxy?
Any chance you can test with the current master code? I made a few changes to how the reloads are handled, and I haven't seen the same behaviour in my recent testing.
I think it will help, and its on my road map but currently, we don't have a nice way of getting the information required for the $PORTS variable in an automated fashion so it will take a bit of time. I'll try and get it wrapped up next week and report back. Thanks for looking into it.
It would be sufficient to specify some subset of ports (or even just one port, for that matter) to test. At the very least, it wouldn't be any worse than what you're using now.
Ok, I was under the impression (only because I didn't investigate too deeply) that it wouldn't work if I didn't have all the $PORTS configured. That makes it a lot easier to test. Thanks
The only limitation is that reloads will not be completely 'zero-downtime' unless you supply the ports ahead of time.
I have upgraded docker marathon-lb to the 1.1.1 tag in our development environment. I haven't noticed the problem so far but as it was fairly unpredictable to begin with, it may not come up for some time. I'll roll with this for a while and hope it keeps working. Here is the same output as before, for comparision:
ejetherington@sandbox-mesos-slave-4h1m:~$ docker ps | grep marathon-lb
81edbad16176 mesosphere/marathon-lb:v1.1.1 "/marathon-lb/run sse" 15 hours ago Up 15 hours marathon-lb
ejetherington@sandbox-mesos-slave-4h1m:~$ docker exec marathon-lb ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 20064 2956 ? Ss 00:34 0:00 /bin/bash /marathon-lb/run sse --marathon http://sandbox-mesos-master-1:8080 --dont-bind-http-https --group *
root 19 0.0 0.0 4100 648 ? S 00:34 0:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 20 1.5 4.0 120023316 1259348 ? Sl 00:34 14:22 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://sandbox-mesos-master-1:8080 --dont-bind-http-https --group *
root 21 0.1 0.0 20916 3756 ? S 00:34 1:22 /bin/bash ./run
root 54 0.0 0.0 28972 5500 ? Ss 00:34 0:37 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf
root 999 0.0 0.0 29124 5636 ? Ss 00:40 0:17 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 918
root 1075 0.0 0.0 29644 6164 ? Ss 00:40 0:20 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 1036
root 3540 0.1 0.0 29248 5616 ? Ss 00:50 1:02 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 3507
root 17900 0.0 0.0 4236 664 ? S 16:04 0:00 sleep 0.5
root 17902 0.0 0.0 17500 1976 ? Rs 16:04 0:00 ps aux
ejetherington@sandbox-mesos-slave-4h1m:~$ ps -aux | grep haproxy
root 12957 0.0 0.0 20864 752 ? Ss Feb18 1:20 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 34
ejether+ 21382 0.0 0.0 10472 2116 pts/0 S+ 16:04 0:00 grep --color=auto haproxy
root 22282 0.0 0.0 4100 648 ? S 00:34 0:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 22283 1.5 4.0 120023316 1259348 ? Sl 00:34 14:22 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://sandbox-mesos-master-1:8080 --dont-bind-http-https --group *
root 22320 0.0 0.0 28972 5500 ? Ss 00:34 0:37 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf
root 24338 0.0 0.0 29124 5636 ? Ss 00:40 0:17 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 918
root 24418 0.0 0.0 29644 6164 ? Ss 00:40 0:20 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 1036
root 28406 0.0 0.0 19656 752 ? Ss Feb19 1:09 haproxy -f /run/haproxy.cfg -p /run/haproxy.pid -D
root 29659 0.1 0.0 29248 5616 ? Ss 00:50 1:02 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 3507
ejetherington@sandbox-mesos-slave-4h1m:~$
Cool, that looks better. I'm going to close the issue for now, but please reopen it if it comes up again.
We had an issue it this in production last night. I wasn't the one troubleshooting it so my information is limited, but there were many haproxy instances running
On the host:
suhas@mesos-slave-ch9u:~$ ps -ef | grep haproxy | grep marathon-lb
root 5358 12951 1 01:31 ? 00:02:01 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 8664
root 9005 12951 1 03:56 ? 00:00:44 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 17417
root 12973 12951 0 Feb23 ? 00:00:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 12974 12951 2 Feb23 ? 00:07:53 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 12977 12951 0 Feb23 ? 00:00:49 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg
root 14274 12951 0 Feb23 ? 00:00:23 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 21
root 16243 12951 1 03:04 ? 00:01:03 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 8681
root 21166 12951 0 Feb23 ? 00:00:06 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 240
root 23608 12951 0 Feb23 ? 00:03:00 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 1286
root 28843 12951 1 04:34 ? 00:00:04 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 19749
Never mind, I hadn't upgraded our production environment yet. My mistake.
Well, we are still having the issue even after upgrading in production. Some output from ps-ef in the docker containers:
mesos-slave-jtr6:
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 16:59 ? 00:00:00 /bin/bash /marathon-lb/run sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 18 1 0 16:59 ? 00:00:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 19 1 7 16:59 ? 00:12:05 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 20 18 0 16:59 ? 00:00:10 /bin/bash ./run
root 35 1 0 16:59 ? 00:00:28 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 15526
root 17444 1 0 18:43 ? 00:00:09 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 17406
root 19002 1 0 18:52 ? 00:00:01 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 18970
root 19959 1 1 18:56 ? 00:00:39 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 19911
root 22529 1 0 19:13 ? 00:00:00 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 19959
root 22579 1 0 19:13 ? 00:00:01 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22529
root 22696 1 0 19:13 ? 00:00:01 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22634
root 23173 1 3 19:16 ? 00:00:39 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22964
root 25663 0 7 19:32 ? 00:00:00 ps -ef
mesos-slave-ch9u:
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 16:59 ? 00:00:00 /bin/bash /marathon-lb/run sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 18 1 0 16:59 ? 00:00:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 19 1 8 16:59 ? 00:12:27 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 20 18 0 16:59 ? 00:00:09 /bin/bash ./run
root 35 1 1 16:59 ? 00:01:37 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 15766
root 1828 1 0 17:08 ? 00:00:03 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 1272
root 2134 1 0 17:09 ? 00:00:37 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 2098
root 5367 1 0 17:28 ? 00:00:06 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 5314
root 11608 1 1 18:06 ? 00:01:11 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 9963
root 22644 1 0 19:13 ? 00:00:01 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22591
root 22706 1 0 19:13 ? 00:00:01 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22644
root 22808 1 0 19:13 ? 00:00:00 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22742
root 22974 1 0 19:14 ? 00:00:03 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22910
root 23179 1 3 19:15 ? 00:00:34 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22974
root 25675 20 0 19:32 ? 00:00:00 sleep 0.5
root 25676 0 0 19:32 ? 00:00:00 ps -ef
mesos-slave-j5q6:
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 16:59 ? 00:00:00 /bin/bash /marathon-lb/run sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 18 1 0 16:59 ? 00:00:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 19 1 9 16:59 ? 00:14:11 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 20 18 0 16:59 ? 00:00:10 /bin/bash ./run
root 11597 1 1 18:06 ? 00:01:43 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 9957
root 19976 1 1 18:56 ? 00:00:35 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 19929
root 23236 1 3 19:16 ? 00:00:37 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22993
root 25688 20 0 19:32 ? 00:00:00 sleep 0.5
root 25689 0 0 19:32 ? 00:00:00 ps -ef
mesos-slave-12r1:
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 19:31 ? 00:00:00 /bin/bash /marathon-lb/run sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 18 1 0 19:31 ? 00:00:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 19 1 5 19:31 ? 00:00:05 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 20 18 0 19:31 ? 00:00:00 /bin/bash ./run
root 51 1 4 19:31 ? 00:00:04 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf
root 336 20 0 19:32 ? 00:00:00 sleep 0.5
root 337 0 0 19:32 ? 00:00:00 ps -ef
mesos-slave-bsww:
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 16:59 ? 00:00:00 /bin/bash /marathon-lb/run sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 18 1 0 16:59 ? 00:00:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 19 1 9 16:59 ? 00:14:28 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 20 18 0 16:59 ? 00:00:11 /bin/bash ./run
root 35 1 1 16:59 ? 00:02:11 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 15713
root 17424 1 2 18:43 ? 00:01:03 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 17384
root 18855 1 0 18:52 ? 00:00:02 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 18805
root 22750 1 0 19:13 ? 00:00:02 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22709
root 22816 1 0 19:13 ? 00:00:00 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22750
root 22980 1 0 19:14 ? 00:00:05 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22914
root 23233 1 4 19:16 ? 00:00:49 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22980
root 25673 20 0 19:32 ? 00:00:00 sleep 0.5
root 25674 0 0 19:32 ? 00:00:00 ps -ef
mesos-slave-5o6w:
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 16:59 ? 00:00:00 /bin/bash /marathon-lb/run sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 18 1 0 16:59 ? 00:00:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 19 1 8 16:59 ? 00:13:44 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 20 18 0 16:59 ? 00:00:11 /bin/bash ./run
root 11583 1 2 18:06 ? 00:01:45 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 9948
root 17282 1 0 18:43 ? 00:00:08 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 11583
root 17315 1 0 18:43 ? 00:00:04 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 17282
root 17344 1 0 18:43 ? 00:00:04 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 17315
root 22826 1 0 19:14 ? 00:00:01 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22788
root 22886 1 0 19:14 ? 00:00:02 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22826
root 22954 1 0 19:14 ? 00:00:04 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22886
root 23185 1 4 19:16 ? 00:00:40 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22954
root 25649 0 8 19:32 ? 00:00:00 ps -ef
root 25659 20 0 19:32 ? 00:00:00 sleep 0.5
mesos-slave-pj18:
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 16:59 ? 00:00:00 /bin/bash /marathon-lb/run sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 18 1 0 16:59 ? 00:00:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 19 1 8 16:59 ? 00:13:12 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 20 18 0 16:59 ? 00:00:12 /bin/bash ./run
root 34 1 0 16:59 ? 00:00:36 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 15462
root 17430 1 0 18:43 ? 00:00:11 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 17395
root 23275 1 3 19:16 ? 00:00:39 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 23023
root 25712 20 0 19:32 ? 00:00:00 sleep 0.5
root 25713 0 0 19:32 ? 00:00:00 ps -ef
mesos-slave-4t3b:
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 16:59 ? 00:00:00 /bin/bash /marathon-lb/run sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 18 1 0 16:59 ? 00:00:00 /usr/bin/runsv /marathon-lb/service/haproxy
root 19 1 9 16:59 ? 00:15:06 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg -c sv reload /marathon-lb/service/haproxy --sse --marathon http://mesos-master-hzfk:8080 --marathon http://mesos-master-5vzd:8080 --marathon http://mesos-master-4fb0:8080 --dont-bind-http-https --group *
root 20 18 0 16:59 ? 00:00:13 /bin/bash ./run
root 34 1 0 16:59 ? 00:00:29 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 15336
root 17357 1 0 18:43 ? 00:00:11 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 17330
root 19965 1 1 18:56 ? 00:00:41 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 19915
root 23209 1 4 19:16 ? 00:00:41 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22954
root 25642 20 0 19:32 ? 00:00:00 sleep 0.5
root 25643 0 0 19:32 ? 00:00:00 ps -ef
Strange. Are you sure there are no long-lived TCP connections? Are you using anything like websockets?
I know that there are long running connections and yes, that is probably the real source of our problem. I'll be doing work to eliminate those long running connections from being proxied through marathon-lb today. In general, do you have any suggestions for proxying long running connections with mesos/marathon?
I am seeing similar issues at https://github.com/QubitProducts/bamboo/issues/200
Have you found a solution to stale processes? There should not be any long running connections in our setup.
We upgraded marathon and marathon-lb and removed long running connections from marathon and marathon-lb. Since then the number of problems related to stale connections has droped to almost zero. Once in a while, we will get some 502 errors from haproxy. Usually this coincided with a flapping service in marathon.
I think if you can slow down the rate of reconfigurations you are likely to avoid issues. Good luck!
We are seeing an interesting issue with v1.0.1 of marthon-lb running in SSE. On a reconfiguration, occasionally the older process remains listening causing stale configuration and the newer process to fail.
I have not be able to reliable reproduce it but it has cause several issues for us recently and the only way. It seems to occur when a service is flapping, or when many deployments are occurring at once. I have a theory that there is a race condition between when the new process has successfully started and the pidfile is read by the next starting process. (https://github.com/mesosphere/marathon-lb/blob/master/service/haproxy/run#L30)
I'm interested in any ideas anyone might have on how to reliably reproduce this or how to guaranty it doesn't bite us anymore. We are not ready to upgrade to v1.1.0 because we don't currently have an up front list of ports.