Why are Mesos slaves in NotReady state in Kubernetes?

sanjana-bhat commented 8 years ago

@jdef I'm running the following kubernetes version 1.1.0. The executors on some of the mesos slaves receive a signal to launch tasks but don't launch it because of k.isDone() is true and it returns. https://github.com/kubernetes/kubernetes/blob/v1.1.0/contrib/mesos/pkg/executor/executor.go#L294 The same holds for kill tasks. All the tasks show up as STAGING and after the mesos slave restart, kubernetes shows these nodes as NotReady. The logs don't seem to give any information on why this terminate channel is receiving values. Do you know what could have caused this?

Thanks much!

jdef commented 8 years ago

We're not yet cutting kubernetes-mesos releases from kubernetes/kubernetes. Please try building from our latest stable release https://github.com/mesosphere/kubernetes/tree/v0.7.1-v1.1.3.

sanjana-bhat commented 8 years ago

Have these issues been fixed in the above release? I notice that the executor receives the launch task signal but the tasks don't get launched. The tasks show as staging. When does this scenario occur?

jdef commented 8 years ago

they likely have. the current stable release in mesosphere/kubernetes is both heavily patched and tested. k8sm built from release-1.1 on kubernetes/kubernetes is neither. please test the stable release branch to see if it resolves your problem.

sanjana-bhat commented 8 years ago

@jdef, thanks! I will test the stable release branch. Problem is, I couldn't reproduce the issue I reported on a fresh setup with the same 1.1 release. Once I killed the executors on these slaves, the tasks got launched fine. Was this a common issue where the executor went bad? Even the mesos slave restart wouldn't fix it, I had to explicitly kill the executor.

jdef commented 8 years ago

I don't remember seeing that particular issue before. That said, there are quite a few code changes between the release 1.1 branch and our stable branch. Some things that come to mind (but may not be related):

https://issues.apache.org/jira/browse/MESOS-3363
664 kubelet-executor hangs on DCOS CM.3, under load

do you have log files from the buggy scenario that you're describing?

sanjana-bhat commented 8 years ago

There are no errors in the log files. This is what the executor log shows when a pod is scheduled

I0112 17:54:01.549210   17434 executor.go:307] Executor driver runTask
I0112 17:54:01.549251   17434 executor.go:321] Executor asked to run task '&TaskID{Value:*pod.7654c050-b955-11e5-b4d0-fa163edb6133,XXX_unrecognized:[],}'
W0112 17:59:47.388312   17434 reflector.go:224] pkg/kubelet/kubelet.go:205: watch of *api.Service ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [11660943/11654930]) [11661942]
W0112 18:46:53.964001   17434 reflector.go:224] pkg/kubelet/kubelet.go:205: watch of *api.Service ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [11667034/11661946]) [11668033]

It doesn't seem to have actually launched the task and the pod is stuck in STAGING state.

jdef commented 8 years ago

any luck with testing on the stable branch?

On Wed, Jan 13, 2016 at 1:08 PM, Sanjana J Bhat notifications@github.com wrote:

There are no errors in the log files. This is what the executor log shows when a pod is scheduled

I0112 17:54:01.549210 17434 executor.go:307] Executor driver runTask I0112 17:54:01.549251 17434 executor.go:321] Executor asked to run task '&TaskID{Value:pod.7654c050-b955-11e5-b4d0-fa163edb6133,XXX_unrecognized:[],}' W0112 17:59:47.388312 17434 reflector.go:224] pkg/kubelet/kubelet.go:205: watch of api.Service ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [11660943/11654930]) [11661942] W0112 18:46:53.964001 17434 reflector.go:224] pkg/kubelet/kubelet.go:205: watch of *api.Service ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [11667034/11661946]) [11668033]

It doesn't seem to have actually launched the task and the pod is stuck in STAGING state.

— Reply to this email directly or view it on GitHub https://github.com/mesosphere/kubernetes-mesos/issues/732#issuecomment-171382712 .

jdef commented 8 years ago

FWIW there's a new v0.7.2-v1.1.5 tag on the release-v0.7-v1.1 branch .. if this isn't working for you yet, you might try testing again with the updated code. the mechanism by which the kubelet-executor processes pod updates has been completely overhauled and should be less buggy.

sanjana-bhat commented 8 years ago

@jdef I will test with this new tag. Thanks!

mesosphere / kubernetes-mesos

Why are Mesos slaves in NotReady state in Kubernetes? #732

664 kubelet-executor hangs on DCOS CM.3, under load