mesosphere / marathon

Deploy and manage containers (including Docker) on top of Apache Mesos at scale.
https://mesosphere.github.io/marathon/
Apache License 2.0
4.07k stars 845 forks source link

Tasks shown as running and started in Marathon, but are not actually running after reboot #1761

Closed F21 closed 7 years ago

F21 commented 9 years ago

This is a problem I have noticed for a while.

I currently have a 1 master/slave and 2 slaves cluster running in vagrant. I often find that if I reboot the cluster, a lot of tasks are shown as running and started in the Marathon web ui. But these tasks are not actually running and checking the Mesos web ui confirms that.

The only way to fix the problem is to go into the Marathon web ui and manually restart all the apps.

Sometimes (in rare cases), the tasks will start and run after the restart.

Here's a mesos-dns task that exhibits this behavior:

{
  "id": "dns",
  "cmd": "/root/go/src/github.com/mesosphere/mesos-dns/mesos-dns -v=1 -config=/root/mesos-dns-config.json",
  "cpus": 0.5,
  "mem": 20.0,
  "instances": 1,
  "ports": [4000],
  "constraints": [
        ["hostname", "CLUSTER", "mesos-master-01"]
  ],
  "healthChecks": [
        {
            "protocol": "COMMAND",
            "command": { "value": "nslookup master.mesos" },
            "maxConsecutiveFailures": 3
        }
  ]
}

I would love to provide more details to see if this can be debugged. Where does Marathon store its logs and diagnostics to debug this problem?

Marathon is 0.8.2 installed via mesosphere's repos running on Ubuntu 14.10 64-bit.

F21 commented 9 years ago

Still seeing this with Marathon 0.9.0 on Mesos 0.22.1.

felixb commented 9 years ago

I'm seeing similar behavior:

When rebooting a slave marathon and mesos get out of sync. Marathon shows stale tasks in UI which can not be deleted (other than removing the state in ZK and switching the leader).

Mesos is showing the following logs when trying to kill the task:

I0721 07:21:42.867261 13770 master.cpp:2623] Asked to kill task develop-ci_task_1.f466bf1e-2ee9-11e5-b2f3-005056972890 of framework 20150719-163009-346470410-17100-5768-0000
W0721 07:21:42.867399 13770 master.cpp:2661] Cannot kill task develop-ci_task_1.f466bf1e-2ee9-11e5-b2f3-005056972890 of framework 20150719-163009-346470410-17100-5768-0000 (marathon) at scheduler-7ee448b9-93c5-472f-bcef-e9639fedc442@10.*.*.*:53254 because it is unknown; performing reconciliation
I0721 07:21:42.867458 13770 master.cpp:3590] Dropping reconciliation of task develop-ci_task_1.f466bf1e-2ee9-11e5-b2f3-005056972890 for framework 20150719-163009-346470410-17100-5768-0000 (marathon) at scheduler-7ee448b9-93c5-472f-bcef-e9639fedc442@10.*.*.*:53254 because there are transitional slaves

I'm running Marathon 0.9.0 on Mesos 0.22.1 as well.

zbzoch commented 9 years ago

It should be possible to delete the app in 0.10.0 - see issue https://github.com/mesosphere/marathon/issues/1853

sudokai commented 8 years ago

I'm still seeing this with Marathon 0.13.0 and Mesos 0.26.0.

akunaatrium commented 8 years ago

Yep, quite annoying. As far as I have understood, Marathon should make sure stuff is running and automatically start the applications/tasks in case of failures. This should be like a top priority defect to fix.

sudokai commented 8 years ago

Yeah, I even added a health check hoping that this would force Marathon to restart the task but it's futile. I also found out that even if I delete and recreate the task, it still doesn't start. I have to actually delete /tmp/mesos for this to work.

suizman commented 8 years ago

Same here with mesos 0.27.0 & marathon 0.15.1

PiotrTrzpil commented 8 years ago

Similarly here, but I noticed that the tasks are eventually started again after ~10 minutes of starting mesos & marathon.

Logs for one of the tasks:

Feb 20 11:33:17 mesos marathon[1033]: ... INFO Send kill request for task <task-info>
Feb 20 11:33:17 mesos mesos-master[1258] ... master.cpp:3445] Cannot kill task <task-info> because it is unknown; performing reconciliation
Feb 20 11:33:17 mesos mesos-master[1258] ... master.cpp:4713] Performing explicit task state reconciliation for 1 tasks of framework <marathon-info>
Feb 20 11:33:17 mesos mesos-master[1258] ... master.cpp:4786] Dropping reconciliation of task <task-info> because there are transitional slaves

Eventually it seems like reconciliation succeeds and marathon starts the task (it also starts tasks with no healthchecks)

malterb commented 8 years ago

Still seeing this with marathon v1.1.1 and mesos v0.28.2

edit: updated to mesos v1.0.0 and same issue happening

Brycelol commented 8 years ago

Has anyone came across a feasible production workaround for this yet?

ghost commented 8 years ago

Same issue. Marathon v1.1.1 & Mesos v0.28.2

At the very least marathon should honor the health checks. But those are ignored after cluster reboot.

zaynetro commented 8 years ago

Can confirm the same issue with Mesos v1.0.1 and Marathon v1.3.2.

After a restart sometimes tasks are shown as running, sometimes there are shown as deploying. Sometimes restarting the task helps, sometimes it doesn't affect the task at all. The only thing that worked for me was to scale task to 0 and wait for 10+ minutes. After performing the scale operation you can scale the app back to the previous value and it all works afterwards.

During the task being stuck in running or deploying state (when actually task is not being run at all) the mesos master logs are full of these messages:

W1009 16:18:47.433548  2752 master.cpp:4117] Cannot kill task nginx.31d11a85-8df5-11e6-99d1-ea4a079326ae of framework 7e94f16a-a6f9-4aff-9917-7c184c9e7ebf-0000 (marathon) at scheduler-371ee16a-ab66-4679-956f-c6eb5121b169@127.0.1.1:44279 because it is unknown; performing reconciliation
I1009 16:18:47.433697  2752 master.cpp:5463] Performing explicit task state reconciliation for 1 tasks of framework 7e94f16a-a6f9-4aff-9917-7c184c9e7ebf-0000 (marathon) at scheduler-371ee16a-ab66-4679-956f-c6eb5121b169@127.0.1.1:44279
W1009 16:18:47.437860  2750 master.cpp:4117] Cannot kill task mesos-dns.2bd80534-8df5-11e6-99d1-ea4a079326ae of framework 7e94f16a-a6f9-4aff-9917-7c184c9e7ebf-0000 (marathon) at scheduler-371ee16a-ab66-4679-956f-c6eb5121b169@127.0.1.1:44279 because it is unknown; performing reconciliation
I1009 16:18:47.438011  2750 master.cpp:5463] Performing explicit task state reconciliation for 1 tasks of framework 7e94f16a-a6f9-4aff-9917-7c184c9e7ebf-0000 (marathon) at scheduler-371ee16a-ab66-4679-956f-c6eb5121b169@127.0.1.1:44279

I would appreciate if someone could help me explain what those messages mean.

elianka commented 7 years ago

any progress?

hampsterx commented 7 years ago

I quite often get the feeling Marathon UI is completely out of sync with what Mesos is reporting :(

Related: #616 Mesos and Marathon out of sync: orphaned tasks and "ghost" tasks

ParimiDev commented 7 years ago

this issue is super annoying

meichstedt commented 7 years ago

Note: This issue has been migrated to https://jira.mesosphere.com/browse/MARATHON-2694. For more information see https://groups.google.com/forum/#!topic/marathon-framework/khtvf-ifnp8.