Closed F21 closed 7 years ago
Still seeing this with Marathon 0.9.0 on Mesos 0.22.1.
I'm seeing similar behavior:
When rebooting a slave marathon and mesos get out of sync. Marathon shows stale tasks in UI which can not be deleted (other than removing the state in ZK and switching the leader).
Mesos is showing the following logs when trying to kill the task:
I0721 07:21:42.867261 13770 master.cpp:2623] Asked to kill task develop-ci_task_1.f466bf1e-2ee9-11e5-b2f3-005056972890 of framework 20150719-163009-346470410-17100-5768-0000
W0721 07:21:42.867399 13770 master.cpp:2661] Cannot kill task develop-ci_task_1.f466bf1e-2ee9-11e5-b2f3-005056972890 of framework 20150719-163009-346470410-17100-5768-0000 (marathon) at scheduler-7ee448b9-93c5-472f-bcef-e9639fedc442@10.*.*.*:53254 because it is unknown; performing reconciliation
I0721 07:21:42.867458 13770 master.cpp:3590] Dropping reconciliation of task develop-ci_task_1.f466bf1e-2ee9-11e5-b2f3-005056972890 for framework 20150719-163009-346470410-17100-5768-0000 (marathon) at scheduler-7ee448b9-93c5-472f-bcef-e9639fedc442@10.*.*.*:53254 because there are transitional slaves
I'm running Marathon 0.9.0 on Mesos 0.22.1 as well.
It should be possible to delete the app in 0.10.0 - see issue https://github.com/mesosphere/marathon/issues/1853
I'm still seeing this with Marathon 0.13.0 and Mesos 0.26.0.
Yep, quite annoying. As far as I have understood, Marathon should make sure stuff is running and automatically start the applications/tasks in case of failures. This should be like a top priority defect to fix.
Yeah, I even added a health check hoping that this would force Marathon to restart the task but it's futile. I also found out that even if I delete and recreate the task, it still doesn't start. I have to actually delete /tmp/mesos for this to work.
Same here with mesos 0.27.0 & marathon 0.15.1
Similarly here, but I noticed that the tasks are eventually started again after ~10 minutes of starting mesos & marathon.
Logs for one of the tasks:
Feb 20 11:33:17 mesos marathon[1033]: ... INFO Send kill request for task <task-info>
Feb 20 11:33:17 mesos mesos-master[1258] ... master.cpp:3445] Cannot kill task <task-info> because it is unknown; performing reconciliation
Feb 20 11:33:17 mesos mesos-master[1258] ... master.cpp:4713] Performing explicit task state reconciliation for 1 tasks of framework <marathon-info>
Feb 20 11:33:17 mesos mesos-master[1258] ... master.cpp:4786] Dropping reconciliation of task <task-info> because there are transitional slaves
Eventually it seems like reconciliation succeeds and marathon starts the task (it also starts tasks with no healthchecks)
Still seeing this with marathon v1.1.1 and mesos v0.28.2
edit: updated to mesos v1.0.0 and same issue happening
Has anyone came across a feasible production workaround for this yet?
Same issue. Marathon v1.1.1 & Mesos v0.28.2
At the very least marathon should honor the health checks. But those are ignored after cluster reboot.
Can confirm the same issue with Mesos v1.0.1 and Marathon v1.3.2.
After a restart sometimes tasks are shown as running, sometimes there are shown as deploying. Sometimes restarting the task helps, sometimes it doesn't affect the task at all. The only thing that worked for me was to scale task to 0 and wait for 10+ minutes. After performing the scale operation you can scale the app back to the previous value and it all works afterwards.
During the task being stuck in running or deploying state (when actually task is not being run at all) the mesos master logs are full of these messages:
W1009 16:18:47.433548 2752 master.cpp:4117] Cannot kill task nginx.31d11a85-8df5-11e6-99d1-ea4a079326ae of framework 7e94f16a-a6f9-4aff-9917-7c184c9e7ebf-0000 (marathon) at scheduler-371ee16a-ab66-4679-956f-c6eb5121b169@127.0.1.1:44279 because it is unknown; performing reconciliation
I1009 16:18:47.433697 2752 master.cpp:5463] Performing explicit task state reconciliation for 1 tasks of framework 7e94f16a-a6f9-4aff-9917-7c184c9e7ebf-0000 (marathon) at scheduler-371ee16a-ab66-4679-956f-c6eb5121b169@127.0.1.1:44279
W1009 16:18:47.437860 2750 master.cpp:4117] Cannot kill task mesos-dns.2bd80534-8df5-11e6-99d1-ea4a079326ae of framework 7e94f16a-a6f9-4aff-9917-7c184c9e7ebf-0000 (marathon) at scheduler-371ee16a-ab66-4679-956f-c6eb5121b169@127.0.1.1:44279 because it is unknown; performing reconciliation
I1009 16:18:47.438011 2750 master.cpp:5463] Performing explicit task state reconciliation for 1 tasks of framework 7e94f16a-a6f9-4aff-9917-7c184c9e7ebf-0000 (marathon) at scheduler-371ee16a-ab66-4679-956f-c6eb5121b169@127.0.1.1:44279
I would appreciate if someone could help me explain what those messages mean.
any progress?
I quite often get the feeling Marathon UI is completely out of sync with what Mesos is reporting :(
Related: #616 Mesos and Marathon out of sync: orphaned tasks and "ghost" tasks
this issue is super annoying
Note: This issue has been migrated to https://jira.mesosphere.com/browse/MARATHON-2694. For more information see https://groups.google.com/forum/#!topic/marathon-framework/khtvf-ifnp8.
This is a problem I have noticed for a while.
I currently have a 1 master/slave and 2 slaves cluster running in vagrant. I often find that if I reboot the cluster, a lot of tasks are shown as running and started in the Marathon web ui. But these tasks are not actually running and checking the Mesos web ui confirms that.
The only way to fix the problem is to go into the Marathon web ui and manually restart all the apps.
Sometimes (in rare cases), the tasks will start and run after the restart.
Here's a mesos-dns task that exhibits this behavior:
I would love to provide more details to see if this can be debugged. Where does Marathon store its logs and diagnostics to debug this problem?
Marathon is 0.8.2 installed via mesosphere's repos running on Ubuntu 14.10 64-bit.