mesosphere / marathon-lb

Marathon-lb is a service discovery & load balancing tool for DC/OS
Apache License 2.0
449 stars 300 forks source link

marathon-lb wont start #508

Closed vitosans closed 5 years ago

vitosans commented 6 years ago

DCOS - v1.10 Marathon-LB v1.11.1

My external marathon-lb on my public agent seems to be suffering from this kind of related bug - This is on DCOS 1.10 when marathon-lb restarts it never recovers. At first I thought the issue was when an application was deployed with:

"labels": {
    "HAPROXY_GROUP": "external"
     },

and the marathon-lb restarted due to this issue. Yesterday I was able to recover by terminating all apps that have external ports mapped, and then:

dcos package install marathon-lb

I was able to get marathon-lb back. But today it started happening and I am not able to recover no matter what I try. My log file looks like this:

[/marathon-lb /marathon-lb/run] 80,443,9090,9091,10000,10001,10002,10003,10004,10005,10006,10007,10008,10009,10010,10011,10012,10013,10014,10015,10016,10017,10018,10019,10020,10021,10022,10023,10024,10025,10026,10027,10028,10029,10030,10031,10032,10033,10034,10035,10036,10037,10038,10039,10040,10041,10042,10043,10044,10045,10046,10047,10048,10049,10050,10051,10052,10053,10054,10055,10056,10057,10058,10059,10060,10061,10062,10063,10064,10065,10066,10067,10068,10069,10070,10071,10072,10073,10074,10075,10076,10077,10078,10079,10080,10081,10082,10083,10084,10085,10086,10087,10088,10089,10090,10091,10092,10093,10094,10095,10096,10097,10098,10099,10100 > /marathon-lb/service/haproxy/env/PORTS
[/marathon-lb /marathon-lb/run] setting sysctl params to: net.ipv4.tcp_tw_reuse=1 net.ipv4.tcp_fin_timeout=30 net.ipv4.tcp_max_syn_backlog=10240 net.ipv4.tcp_max_tw_buckets=400000 net.ipv4.tcp_max_orphans=60000 net.core.somaxconn=10000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_orphans = 60000
net.core.somaxconn = 10000
[/marathon-lb /marathon-lb/run] Created /marathon-lb/service/lb/run with contents:
[/marathon-lb /marathon-lb/run] #!/bin/sh
exec 2>&1
cd /marathon-lb
exec /marathon-lb/marathon_lb.py     --syslog-socket /dev/null     --haproxy-config /marathon-lb/haproxy.cfg     --ssl-certs "/etc/ssl/cert.pem"     --command "sv reload /marathon-lb/service/haproxy"     --sse -m http://marathon.mesos:8080 --health-check --haproxy-map --max-reload-retries 10 --reload-interval 10 --group external
[/marathon-lb/service/haproxy ./run] Reloading haproxy
[/marathon-lb/service/haproxy ./run] Dropping SYN packets with addFirewallRules
2017-10-20 22:00:08,755 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ROUTING_ONLY_WITH_PATH
2017-10-20 22:00:08,755 marathon_lb: setting default value for HAPROXY_HTTPS_FRONTEND_ROUTING_ONLY_WITH_PATH_AND_AUTH
2017-10-20 22:00:08,755 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ACL_ONLY_WITH_PATH_AND_AUTH
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_HEAD
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_BACKEND_HEAD
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_MAP_HTTP_FRONTEND_ACL
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ACL_WITH_AUTH_AND_PATH
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_MAP_HTTP_FRONTEND_ACL_ONLY
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_HTTP_BACKEND_NETWORK_ALLOWED_ACL
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_MAP_HTTPS_FRONTEND_ACL
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_HTTPS_FRONTEND_ACL_WITH_AUTH_AND_PATH
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_BACKEND_REDIRECT_HTTP_TO_HTTPS_WITH_PATH
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ROUTING_ONLY
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_HTTP_BACKEND_ACL_ALLOW_DENY
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_BACKEND_TCP_HEALTHCHECK_OPTIONS
2017-10-20 22:00:08,756 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_APPID_HEAD
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_HTTPS_FRONTEND_AUTH_ACL_ONLY
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_BACKEND_HSTS_OPTIONS
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_MAP_HTTP_FRONTEND_APPID_ACL
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_BACKEND_HTTP_OPTIONS
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_BACKEND_STICKY_OPTIONS
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_HTTPS_FRONTEND_ACL
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_USERLIST_HEAD
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_BACKEND_SERVER_HTTP_HEALTHCHECK_OPTIONS
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_TCP_BACKEND_ACL_ALLOW_DENY
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_BACKEND_HTTP_HEALTHCHECK_OPTIONS
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ROUTING_ONLY_WITH_AUTH
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ACL
2017-10-20 22:00:08,757 marathon_lb: setting default value for HAPROXY_HTTP_BACKEND_PROXYPASS_GLUE
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ROUTING_ONLY_WITH_PATH_AND_AUTH
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_BACKEND_SERVER_TCP_HEALTHCHECK_OPTIONS
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_BACKEND_SERVER_OPTIONS
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_BACKEND_REDIRECT_HTTP_TO_HTTPS
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_APPID_ACL
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_HTTPS_FRONTEND_AUTH_REQUEST_ONLY
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_HTTP_BACKEND_REVPROXY_GLUE
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_HTTPS_FRONTEND_ACL_WITH_PATH
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_HEAD
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_HTTPS_FRONTEND_HEAD
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_HTTPS_FRONTEND_ACL_ONLY_WITH_PATH
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_FRONTEND_HEAD
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_FRONTEND_BACKEND_GLUE
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ACL_WITH_AUTH
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_TCP_BACKEND_NETWORK_ALLOWED_ACL
2017-10-20 22:00:08,758 marathon_lb: setting default value for HAPROXY_HTTPS_FRONTEND_ACL_WITH_AUTH
2017-10-20 22:00:08,759 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ACL_WITH_PATH
2017-10-20 22:00:08,759 marathon_lb: setting default value for HAPROXY_HTTP_BACKEND_REDIR
2017-10-20 22:00:08,759 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ACL_ONLY_WITH_PATH
2017-10-20 22:00:08,759 marathon_lb: setting default value for HAPROXY_HTTP_FRONTEND_ACL_ONLY
2017-10-20 22:00:08,759 marathon_lb: starting event processor thread
2017-10-20 22:00:08,759 marathon_lb: SSE Active, trying fetch events from http://marathon.mesos:8080/v2/events
2017-10-20 22:00:08,759 marathon_lb: fetching apps
2017-10-20 22:00:08,767 marathon_lb: received event of type event_stream_attached
2017-10-20 22:00:08,771 marathon_lb: GET http://marathon.mesos:8080/v2/apps?embed=apps.tasks
2017-10-20 22:00:08,773 marathon_lb: got apps ['/marathon-lb', '/marathon-lb-internal', '/portworx']
2017-10-20 22:00:08,778 marathon_lb: generating config
2017-10-20 22:00:08,779 marathon_lb: HAProxy dir is /marathon-lb
2017-10-20 22:00:08,779 marathon_lb: reading running config from /marathon-lb/haproxy.cfg
2017-10-20 22:00:08,779 marathon_lb: couldn't open config file for reading
2017-10-20 22:00:08,779 marathon_lb: running config/map is different from generated config - reloading
2017-10-20 22:00:08,780 marathon_lb: writing temp file /tmp/tmp76wjr2tw that will replace /marathon-lb/domain2backend.map
2017-10-20 22:00:08,780 marathon_lb: writing temp file /tmp/tmpbjq78wa5 that will replace /marathon-lb/app2backend.map
2017-10-20 22:00:08,780 marathon_lb: writing temp file /tmp/tmphf6uabnm that will replace /marathon-lb/haproxy.cfg
2017-10-20 22:00:08,780 marathon_lb: checking config with command: ['haproxy', '-f', '/tmp/tmphf6uabnm', '-c']
[WARNING] 292/220008 (152) : Can't open server state file '/var/state/haproxy/global': No such file or directory
Configuration file is valid
2017-10-20 22:00:08,786 marathon_lb: moving temp file /tmp/tmp76wjr2tw to /marathon-lb/domain2backend.map
2017-10-20 22:00:08,786 marathon_lb: moving temp file /tmp/tmpbjq78wa5 to /marathon-lb/app2backend.map
2017-10-20 22:00:08,786 marathon_lb: moving temp file /tmp/tmphf6uabnm to /marathon-lb/haproxy.cfg
2017-10-20 22:00:08,787 marathon_lb: reloading using sv reload /marathon-lb/service/haproxy
2017-10-20 22:00:08,789 marathon_lb: Unable to get haproxy pids: Command 'pidof haproxy' returned non-zero exit status 1
ok: run: /marathon-lb/service/haproxy: (pid 33) 0s
2017-10-20 22:00:08,792 marathon_lb: Unable to get haproxy pids: Command 'pidof haproxy' returned non-zero exit status 1
2017-10-20 22:00:08,792 marathon_lb: Waiting for new haproxy pid (old pids: [set()], new_pids: [set()])...
[/marathon-lb/service/haproxy ./run] addFirewallRules done
[/marathon-lb/service/haproxy ./run] Saving the current HAProxy state
[/marathon-lb/service/haproxy ./run] Done saving the current HAProxy state
cat: /tmp/haproxy.pid: No such file or directory
[/marathon-lb/service/haproxy ./run] LATEST_HAPROXY_PID: []
[/marathon-lb/service/haproxy ./run] /marathon-lb/haproxy_wrapper.py /usr/local/sbin/haproxy -D -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf  200>&-
2017-10-20 22:00:08,865 haproxy_wrapper: create_haproxy_pipe called
2017-10-20 22:00:08,865 haproxy_wrapper: create_haproxy_pipe done
2017-10-20 22:00:08,865 haproxy_wrapper: wait_on_haproxy_pipe called
[WARNING] 292/220008 (162) : Can't read first line of the server state file '/var/state/haproxy/global'
[ALERT] 292/220008 (162) : sendmsg logger #1 failed: No such file or directory (errno=2)
[ALERT] 292/220008 (162) : sendmsg logger #2 failed: No such file or directory (errno=2)
[ALERT] 292/220008 (162) : sendmsg logger #1 failed: No such file or directory (errno=2)
[ALERT] 292/220008 (162) : sendmsg logger #2 failed: No such file or directory (errno=2)
[ALERT] 292/220008 (162) : sendmsg logger #1 failed: No such file or directory (errno=2)
[ALERT] 292/220008 (162) : sendmsg logger #2 failed: No such file or directory (errno=2)
[ALERT] 292/220008 (162) : sendmsg logger #1 failed: No such file or directory (errno=2)
[ALERT] 292/220008 (162) : sendmsg logger #2 failed: No such file or directory (errno=2)
2017-10-20 22:00:08,870 haproxy_wrapper: close_and_swallow called
2017-10-20 22:00:08,871 haproxy_wrapper: close_and_swallow successful
2017-10-20 22:00:08,871 haproxy_wrapper: close_and_swallow called
2017-10-20 22:00:08,871 haproxy_wrapper: close_and_swallow swallow OSError: [Errno 9] Bad file descriptor
2017-10-20 22:00:08,871 haproxy_wrapper: wait_on_haproxy_pipe done (False)
[/marathon-lb/service/haproxy ./run] exit code: 0
[/marathon-lb/service/haproxy ./run] Removing firewall rules with removeFirewallRules
2017-10-20 22:00:08,895 marathon_lb: new pids: [{163}]
2017-10-20 22:00:08,895 marathon_lb: reload finished, took 0.10831427574157715 seconds
2017-10-20 22:00:08,895 marathon_lb: updating tasks finished, took 0.13614583015441895 seconds
[/marathon-lb/service/haproxy ./run] removeFirewallRules done
[/marathon-lb/service/haproxy ./run] Reloading haproxy
[/marathon-lb/service/haproxy ./run] Reload finished
[/marathon-lb/service/haproxy ./run] Dropping SYN packets with addFirewallRules
[/marathon-lb/service/haproxy ./run] addFirewallRules done
[/marathon-lb/service/haproxy ./run] Saving the current HAProxy state
[/marathon-lb/service/haproxy ./run] Done saving the current HAProxy state
[/marathon-lb/service/haproxy ./run] LATEST_HAPROXY_PID: [163]
[/marathon-lb/service/haproxy ./run] /marathon-lb/haproxy_wrapper.py /usr/local/sbin/haproxy -D -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -sf 163 200>&-
2017-10-20 22:00:10,550 haproxy_wrapper: create_haproxy_pipe called
2017-10-20 22:00:10,550 haproxy_wrapper: create_haproxy_pipe done
2017-10-20 22:00:10,551 haproxy_wrapper: wait_on_haproxy_pipe called
[ALERT] 292/220010 (493) : sendmsg logger #1 failed: No such file or directory (errno=2)
[ALERT] 292/220010 (493) : sendmsg logger #2 failed: No such file or directory (errno=2)
[ALERT] 292/220010 (493) : sendmsg logger #1 failed: No such file or directory (errno=2)
[ALERT] 292/220010 (493) : sendmsg logger #2 failed: No such file or directory (errno=2)
[ALERT] 292/220010 (493) : sendmsg logger #1 failed: No such file or directory (errno=2)
[ALERT] 292/220010 (493) : sendmsg logger #2 failed: No such file or directory (errno=2)
[ALERT] 292/220010 (493) : sendmsg logger #1 failed: No such file or directory (errno=2)
[ALERT] 292/220010 (493) : sendmsg logger #2 failed: No such file or directory (errno=2)
2017-10-20 22:00:10,556 haproxy_wrapper: close_and_swallow called
2017-10-20 22:00:10,556 haproxy_wrapper: close_and_swallow successful
2017-10-20 22:00:10,556 haproxy_wrapper: close_and_swallow called
2017-10-20 22:00:10,556 haproxy_wrapper: close_and_swallow swallow OSError: [Errno 9] Bad file descriptor
2017-10-20 22:00:10,556 haproxy_wrapper: wait_on_haproxy_pipe done (False)
[/marathon-lb/service/haproxy ./run] exit code: 0
[/marathon-lb/service/haproxy ./run] Removing firewall rules with removeFirewallRules
[/marathon-lb/service/haproxy ./run] removeFirewallRules done
[/marathon-lb/service/haproxy ./run] Reload finished
2017-10-20 22:00:13,263 marathon_lb: received event of type event_stream_detached
2017-10-20 22:01:11,965 marathon_lb: received event of type failed_health_check_event
2017-10-20 22:01:16,986 marathon_lb: received event of type failed_health_check_event
2017-10-20 22:01:22,006 marathon_lb: received event of type failed_health_check_event
2017-10-20 22:01:22,007 marathon_lb: received event of type unhealthy_instance_kill_event
2017-10-20 22:01:22,021 marathon_lb: received event of type status_update_event
2017-10-20 22:01:22,021 marathon_lb: fetching apps
2017-10-20 22:01:22,025 marathon_lb: received event of type instance_changed_event

So interesting things that stand out are:

[WARNING] 292/220008 (152) : Can't open server state file '/var/state/haproxy/global': No such file or directory

and

2017-10-20 22:00:08,787 marathon_lb: reloading using sv reload /marathon-lb/service/haproxy
2017-10-20 22:00:08,789 marathon_lb: Unable to get haproxy pids: Command 'pidof haproxy' returned non-zero exit status 1
ok: run: /marathon-lb/service/haproxy: (pid 33) 0s
2017-10-20 22:00:08,792 marathon_lb: Unable to get haproxy pids: Command 'pidof haproxy' returned non-zero exit status 1
2017-10-20 22:00:08,792 marathon_lb: Waiting for new haproxy pid (old pids: [set()], new_pids: [set()])...

and

2017-10-20 22:00:08,870 haproxy_wrapper: close_and_swallow called
2017-10-20 22:00:08,871 haproxy_wrapper: close_and_swallow successful
2017-10-20 22:00:08,871 haproxy_wrapper: close_and_swallow called
2017-10-20 22:00:08,871 haproxy_wrapper: close_and_swallow swallow OSError: [Errno 9] Bad file descriptor

This just started happening to me today, yesterday the symptoms where much differnet than today.

deric commented 6 years ago

@vitosans The logs looks quite normal. Are you sure you're talking to the correct Marathon instance? http://marathon.mesos:8080 is the DC/OS system Marathon, that means that the app with the external label must be installed in "Services", not in marathon-user instance. Also consider adding HAPROXY_0_PORT or HAPROXY_0_VHOST label.

vitosans commented 6 years ago

Hi @deric

I was just following the DC/OS tutorial here:

https://dcos.io/docs/1.10/networking/marathon-lb/marathon-lb-advanced-tutorial/

and running:

dcos package install marathon-lb

to set up both internal and external marathon-lb in the past for 1.9 this was a no issue. But right now I can reproduce it 100% of the time. the external refuses to start. I have even reinstalled the public-slave thinking I might have goofed on something. The internal has no issue at all.

I see this in the logs:

2017-10-23 18:08:17,075 marathon_lb: received event of type failed_health_check_event
2017-10-23 18:08:22,093 marathon_lb: received event of type failed_health_check_event
2017-10-23 18:08:27,115 marathon_lb: received event of type failed_health_check_event
2017-10-23 18:08:27,116 marathon_lb: received event of type unhealthy_instance_kill_event
2017-10-23 18:08:27,129 marathon_lb: received event of type status_update_event
2017-10-23 18:08:27,129 marathon_lb: received event of type instance_changed_event
2017-10-23 18:08:27,129 marathon_lb: fetching apps

but don't see what the failed event is. I have tried getting a shell on the container running the external to reproduce the failed health check. What concerns me is this is a fresh install nothing changed all done via the book except I have added portworx to test. I will try your suggestions and then read the marathon-lb.py and see if I can recreate the health check its failing.

kennethjiang commented 6 years ago

I have the same issue here. The weird thing is that 1 marathon-lb instance always has health-check green. But when I tried to scale it to 2 instances, the new one always failed with exactly the same error messages described in this issue.

When I was on DC/OS version 1.9 I never had this issue.

My DC/OS cluster is serving production traffic. It's never wracking to see production env running under 1 haproxy, which may be running to the same issue any minute!

wobes commented 6 years ago

I was having this same issue after our cluster suffered a network outage over the weekend. we have two public slaves, and on one of them marathon-lb would not start. We saw the following error in the task log by running:

dcos marathon task list --json /marathon-lb

message": "Task was killed since health check failed. Reason: BufferOverflowException: Exceeded configured max-open-requests value of [32]. This means that the request queue of this pool (HostConnectionPoolSetup(10.9.x.x,9090,ConnectionPoolSetup(ConnectionPool Settings(4,0,5,32,1,30 seconds,ClientConnectionSettings(Some(User-Agent: akka-http/10.0.6),10 seconds,1 minute,512,None,,List(),ParserSettings(2048,16,64,64,8192,64,268435456,256,1048576,Strict,RFC6265,true,Full,Error,Map(If-Range -> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, Content-MD5 -> 0, Date -> 0, If-Match -> 0, If-None-Match -> 0, User-Agent -> 32),false,,,)),TCPTransport(None,ClientConnectionSettings(Some(User-Agent: akka-http/10.0.6),10 seconds,1 minute ,512,None,,List(),ParserSettings(2048,16,64,64,8192,64,268435456,256,1048576,Strict,RFC6265,true,Full,Error,Map(If-Range -> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, Content-MD5 -> 0, Date -> 0, If-Match -> 0, If-None-Match -> 0, U ser-Agent -> 32),false,,,)))),akka.http.scaladsl.HttpConnectionContext$@549e393c,akka.event.MarkerLoggingAdapter@3a0c7578))) has completely filled up because the pool currently does not process requests fast enough to handle the incomi ng request load. Please retry the request later. See http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.",

was able to fix by forcing a leader election by running the following command on the leader node.

systemctl restart dcos-mesos-master

after a new leader was elected, marathon-lb started up and the health check passed successfully.

pramodhkp commented 6 years ago

Hi, I'm facing a similar issue. It's healthy on one node, but unhealthy on the other. The above mentioned fix didn't work for me. Is there any other workaround?

mimmus commented 6 years ago

Similar issues here during an upgrade from 1.9.4 to 1.10.4. Apparently solved only uninstalling and reinstalling marathon-lb, with much downtime.

justinrlee commented 6 years ago

@wobes what version of DC/OS are you using? It sounds like you had an issue with mesos overall, but I can't tell from your logs.

@pramodhkp what errors are you seeing in the Marathon-LB log?

@mimmus How many masters do you have, and how many public agents? What version of Marathon-LB are/were you on? Did you first upgrade to 1.9.7 per the DC/OS upgrade instructions? If you have 3+ masters, you should be able to upgrade in-place, and I wouldn't expect Marathon-LB to require a reinstall (at the most, it should require a restart, but it shouldn't even require that).

mimmus commented 6 years ago

@justinrlee

justinrlee commented 6 years ago

@mimmus What errors and symptoms did you see when your Marathon-LB stopped working? Do you happen to have any logs?

Actually, if you were seeing Marathon-LB continually restarting, that's potentially caused by this: https://jira.mesosphere.com/browse/MARATHON-7572

The 'fix' is to switch from HTTP to MESOS_HTTP healthchecks. I have a hunch that this is what you saw.

mimmus commented 6 years ago

@justinrlee I'm sorry to have saved no logs :( Marathon-lb stopped to work just after completing cluster upgrade. I had two orders of problems:

pramodhkp commented 6 years ago

@justinrlee Mine was a false alarm. We had one more service running on one of the service ports that was causing an issue.

wobes commented 6 years ago

@justinrlee The version at the time we had this initial issue was 1.10.1, however we have also seen it with 1.10.3. The issue seems be be with the health check using connection pooling (akka-http). if you have access to the mesosphere support tickets see Request #8942.

Here is the response on that request:

"Update on the issue Marathon team acknowledged that connection pooling of HTTP healtchecks (that is causing this issue) is undesirable for the pattern of requests we are making. https://jira.mesosphere.com/browse/MARATHON-7940 has been filed to help address the connection pooling issue that happens when the socket fails to close.

We do not know which release of DC/OS would that change be implemented in, but for the time being we would recommend using MESOS_HTTP healthchecks instead of now deprecated marathon HTTP healthchecks."

so the work around is to modify the health check that the masters use to check the health of marathon-lb. Also like I mentioned before, forcing a leader election also cleared this issue for us.

mimmus commented 6 years ago

This morning, serious problems: marathon-lb continously reloading, I'm unable ro recover. Suggested workarounds don't work.

vespian commented 6 years ago

I believe that this issue and https://github.com/mesosphere/marathon-lb/issues/504 refer to the same problem.

Please check my reply to https://github.com/mesosphere/marathon-lb/issues/504 [1]. I am going to close this issue. Let me know if there is anything else I can help with. Thanks!

[1] https://github.com/mesosphere/marathon-lb/issues/504#issuecomment-398883193