mesosphere / marathon

Deploy and manage containers (including Docker) on top of Apache Mesos at scale.
https://mesosphere.github.io/marathon/
Apache License 2.0
4.07k stars 845 forks source link

marathon health check problem with tcp protocol #3517

Closed piagodai closed 8 years ago

piagodai commented 8 years ago

Greetings,

I'm new to Mesos and Marathon and now I'm trying to deploy my app which is dockerized via marathon. I'm using the marathon 0.15.3 and mesos 0.27.2. My app exposed a port 30011 which I can telnet to make sure the app is working. So I added a health check in Marathon with tcp protocol to check the 30011 port. The configuration is as below:

Health Checks
[
  {
    "protocol": "TCP",
    "gracePeriodSeconds": 300,
    "intervalSeconds": 5,
    "timeoutSeconds": 20,
    "maxConsecutiveFailures": 0,
    "ignoreHttp1xx": false,
    "port": 30011
  }
]

But after my app was started up, the health check result firstly shows "Healthy" but turned into "Unhealthy" within several seconds and then for ever... I checked the marathon logs but could not get much more detail information, below are the logs:

Mar 17 15:18:54 mesos-mid-01 marathon[23934]: [2016-03-17 15:18:54,298] INFO Received health result for app [/cashier] version [2016-03-17T07:16:18.637Z]: [Unhealthy(cashier.24dec879-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:16:18.637Z,UnknownHostException: mesos-mid-04,2016-03-17T07:18:54.298Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4337)
Mar 17 15:18:59 mesos-mid-01 marathon[23934]: [2016-03-17 15:18:59,322] INFO Received health result for app [/cashier] version [2016-03-17T07:16:18.637Z]: [Unhealthy(cashier.24dec879-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:16:18.637Z,UnknownHostException: mesos-mid-04,2016-03-17T07:18:59.321Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4396)
Mar 17 15:19:04 mesos-mid-01 marathon[23934]: [2016-03-17 15:19:04,339] INFO Received health result for app [/cashier] version [2016-03-17T07:16:18.637Z]: [Unhealthy(cashier.24dec879-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:16:18.637Z,UnknownHostException: mesos-mid-04,2016-03-17T07:19:04.339Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4368)
Mar 17 15:19:19 mesos-mid-01 marathon[23934]: [2016-03-17 15:19:19,018] INFO cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc is now healthy (mesosphere.marathon.upgrade.TaskStartActor:marathon-akka.actor.default-dispatcher-4398)
Mar 17 15:19:44 mesos-mid-01 marathon[23934]: [2016-03-17 15:19:44,009] INFO Received health result for app [/cashier] version [2016-03-17T07:19:08.913Z]: [Unhealthy(cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:19:08.913Z,SocketTimeoutException: connect timed out,2016-03-17T07:19:44.008Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4396)
Mar 17 15:19:49 mesos-mid-01 marathon[23934]: [2016-03-17 15:19:49,029] INFO Received health result for app [/cashier] version [2016-03-17T07:19:08.913Z]: [Unhealthy(cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:19:08.913Z,SocketTimeoutException: connect timed out,2016-03-17T07:19:49.029Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4376)
Mar 17 15:19:49 mesos-mid-01 marathon[23934]: [2016-03-17 15:19:49,100] INFO Received health result for app [/cashier] version [2016-03-17T07:19:08.913Z]: [Unhealthy(cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:19:08.913Z,NoRouteToHostException: No route to host,2016-03-17T07:19:49.100Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4389)
Mar 17 15:19:54 mesos-mid-01 marathon[23934]: [2016-03-17 15:19:54,049] INFO Received health result for app [/cashier] version [2016-03-17T07:19:08.913Z]: [Unhealthy(cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:19:08.913Z,SocketTimeoutException: connect timed out,2016-03-17T07:19:54.049Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4396)
Mar 17 15:19:59 mesos-mid-01 marathon[23934]: [2016-03-17 15:19:59,058] INFO Received health result for app [/cashier] version [2016-03-17T07:19:08.913Z]: [Unhealthy(cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:19:08.913Z,SocketTimeoutException: connect timed out,2016-03-17T07:19:59.058Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4400)
Mar 17 15:20:04 mesos-mid-01 marathon[23934]: [2016-03-17 15:20:04,090] INFO Received health result for app [/cashier] version [2016-03-17T07:19:08.913Z]: [Unhealthy(cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:19:08.913Z,SocketTimeoutException: connect timed out,2016-03-17T07:20:04.089Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4377)
Mar 17 15:20:14 mesos-mid-01 marathon[23934]: [2016-03-17 15:20:14,109] INFO Received health result for app [/cashier] version [2016-03-17T07:19:08.913Z]: [Unhealthy(cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:19:08.913Z,SocketTimeoutException: connect timed out,2016-03-17T07:20:14.109Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4376)
Mar 17 15:20:19 mesos-mid-01 marathon[23934]: [2016-03-17 15:20:19,151] INFO Received health result for app [/cashier] version [2016-03-17T07:19:08.913Z]: [Unhealthy(cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:19:08.913Z,SocketTimeoutException: connect timed out,2016-03-17T07:20:19.149Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4403)
Mar 17 15:20:24 mesos-mid-01 marathon[23934]: [2016-03-17 15:20:24,169] INFO Received health result for app [/cashier] version [2016-03-17T07:19:08.913Z]: [Unhealthy(cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:19:08.913Z,SocketTimeoutException: connect timed out,2016-03-17T07:20:24.169Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4389)
Mar 17 15:20:29 mesos-mid-01 marathon[23934]: [2016-03-17 15:20:29,181] INFO Received health result for app [/cashier] version [2016-03-17T07:19:08.913Z]: [Unhealthy(cashier.8a5a3eaa-ec10-11e5-b02d-52540082a2fc,2016-03-17T07:19:08.913Z,SocketTimeoutException: connect timed out,2016-03-17T07:20:29.181Z)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-4337)

I already set the resource for port on mesos-slaves to include the tcp port, as " --resources=ports:[30000-32000, 10000-14000]“

can any one help me on this? Or even tell me how to get more detail information for debug is helpful. Thank you very much!

piagodai commented 8 years ago

I believe it is caused by a connection problem between the mesos master and slaves, not related to maration, so close this issue.