target / pod-reaper

Rule based pod killing kubernetes controller
MIT License
200 stars 52 forks source link

Pod status rule is misleading #65

Open jdharmon opened 3 years ago

jdharmon commented 3 years ago

Kubectl shows the pod phase as the status, e.g. Running, Succeeded, Failed. The pod status rule checks the optional reason, e.g. Evicted. This is confusing, and also means you cannot create a POD_STATUS=Running rule. Can we change the pod status rule to use phase? Should the current reason rule be renamed/use a different env variable? POD_STATUS_REASON?

Relevant documentation:

    // The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle.
    // The conditions array, the reason and message fields, and the individual container status
    // arrays contain more detail about the pod's status.
    // There are five possible phase values:
    //
    // Pending: The pod has been accepted by the Kubernetes system, but one or more of the
    // container images has not been created. This includes time before being scheduled as
    // well as time spent downloading images over the network, which could take a while.
    // Running: The pod has been bound to a node, and all of the containers have been created.
    // At least one container is still running, or is in the process of starting or restarting.
    // Succeeded: All containers in the pod have terminated in success, and will not be restarted.
    // Failed: All containers in the pod have terminated, and at least one container has
    // terminated in failure. The container either exited with non-zero status or was terminated
    // by the system.
    // Unknown: For some reason the state of the pod could not be obtained, typically due to an
    // error in communicating with the host of the pod.
    //
    // More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-phase
    // +optional
    Phase PodPhase `json:"phase,omitempty" protobuf:"bytes,1,opt,name=phase,casttype=PodPhase"`
    // A brief CamelCase message indicating details about why the pod is in this state.
    // e.g. 'Evicted'
    // +optional
    Reason string `json:"reason,omitempty" protobuf:"bytes,4,opt,name=reason"`
brianberzins commented 3 years ago

I was wondering if/when this would ever come back to bite me... https://github.com/target/pod-reaper/issues/37 Long story short... "status" as listed by kubectl isn't either pod-phase or reason... it's actually something different... something not addressed directly in the the API from what I saw.

I suspect that Running isn't actually a "pod status" but actually a "container status". I 100% agree, it's confusing as hell!

// ContainerState holds a possible state of container.
// Only one of its members may be specified.
// If none of them is specified, the default one is ContainerStateWaiting.
type ContainerState struct {
    // Details about a waiting container
    // +optional
    Waiting *ContainerStateWaiting `json:"waiting,omitempty" protobuf:"bytes,1,opt,name=waiting"`
    // Details about a running container
    // +optional
    Running *ContainerStateRunning `json:"running,omitempty" protobuf:"bytes,2,opt,name=running"`
    // Details about a terminated container
    // +optional
    Terminated *ContainerStateTerminated `json:"terminated,omitempty" protobuf:"bytes,3,opt,name=terminated"`
}

There isn't anything that detects "Running" container state right now, and I think changing the "container status" rule would probably would result in behavior that isn't intended (I think changing the already existing rule would result in the pod being flagged for reaping if ANY of the containers in it were "Running" -- like it currently identifies if ANY container in the pod is "terminated" or "waiting"... uhg...

jdharmon commented 3 years ago

Perhaps the way forward is

  1. Rename current pod_status rule to pod_status_reason, and change the environment variable to POD_STATUS_REASON
  2. Add pod_status_phase rule with environment variable POD_STATUS_PHASE
gburiola commented 3 months ago

+1 to the suggestion from @jdharmon above.

We spent quite some time understanding why pod reaper was not working when using POD_STATUSES=Succeeded and POD_STATUSES=Completed only to realize it was using pod.reason and not pod.phase.