volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
3.91k stars 901 forks source link

RetryCount in pod's containers ENV vars and pod's Labels #3528

Open belo4ya opened 3 weeks ago

belo4ya commented 3 weeks ago

What would you like to be added:

Why is this needed:

Log collection. Log collectors such as Vector, FluentBit, Fluentd, etc., can enrich log entries with additional data, particularly Pod labels and annotations.

Having the current RetryCount value in the Pod's labels would be convenient, as it can be extracted and stored by the log collector. This allows for the construction of a unique identifier for each individual run to filter logs: run_id = (job_id, retry_count).


An environment variable with the RetryCount value can simplify writing idempotent code:

import os

if int(os.environ["VC_RETRY_COUNT"]) == 0:
    init_once()
else:
    ...

It can also be used in custom logs for subsequent log filtering and analysis of differences between retries.


Implementation:

This could be an extension of the existing controllers/job/plugins/env plugin, a new plugin controllers/job/plugins/retrycount, or directly implemented when creating Pods in the controller. Which option would be preferable?

Example of the implementation of the new retryCount plugin:

package retrycount

import (
    "strconv"

    v1 "k8s.io/api/core/v1"
    batch "volcano.sh/apis/pkg/apis/batch/v1alpha1"
    pluginsinterface "volcano.sh/volcano/pkg/controllers/job/plugins/interface"
)

const (
    PluginName = "retryCount"

    EnvRetryCount   = "VC_RETRY_COUNT"
    LabelRetryCount = "volcano.sh/retry-count"
)

func New(client pluginsinterface.PluginClientset, arguments []string) pluginsinterface.PluginInterface {
    return &retryCountPlugin{
        pluginArguments: arguments,
        Clientset:       client,
    }
}

type retryCountPlugin struct {
    pluginArguments []string
    Clientset       pluginsinterface.PluginClientset
}

func (p *retryCountPlugin) Name() string {
    return PluginName
}

func (p *retryCountPlugin) OnPodCreate(pod *v1.Pod, job *batch.Job) error {
    retryCount := strconv.Itoa(int(job.Status.RetryCount))

    if pod.Labels == nil {
        pod.Labels = map[string]string{}
    }
    pod.Labels[LabelRetryCount] = retryCount

    if pod.Annotations == nil {
        pod.Annotations = map[string]string{}
    }
    pod.Annotations[LabelRetryCount] = retryCount

    envRetryCount := v1.EnvVar{Name: EnvRetryCount, Value: retryCount}
    for i := range pod.Spec.InitContainers {
        pod.Spec.InitContainers[i].Env = append(pod.Spec.InitContainers[i].Env, envRetryCount)
    }
    for i := range pod.Spec.Containers {
        pod.Spec.Containers[i].Env = append(pod.Spec.Containers[i].Env, envRetryCount)
    }

    return nil
}

func (p *retryCountPlugin) OnJobAdd(job *batch.Job) error {
    job.Status.ControlledResources["plugin-"+p.Name()] = p.Name()
    return nil
}

func (p *retryCountPlugin) OnJobDelete(job *batch.Job) error {
    delete(job.Status.ControlledResources, "plugin-"+p.Name())
    return nil
}

func (p *retryCountPlugin) OnJobUpdate(_ *batch.Job) error {
    return nil
}
belo4ya commented 3 weeks ago

Placing the RetryCount value in Labels might be a priority since some log collectors may not support extracting data from Pod Annotations. For example, the fluent-plugin-kubernetes_metadata_filter does not seem support Pod annotations.

googs1025 commented 3 weeks ago

Hello, I'm curious about one thing. What is the purpose of adding a retry count label within the container itself? Typically, containerized applications are stateless, and there should be no distinction between the initial startup and a restart after an unexpected crash. For example, in the case of a job task for training a dataset, it wouldn't usually matter how many retries have occurred after an unexpected restart. Could you provide more context or clarify the specific use case where having a retry count label inside the container would be beneficial?

Monokaix commented 2 weeks ago

Hi, do you mean you want to collect logs by pod's restart index to distinguish different pods or containers' logs?

belo4ya commented 2 weeks ago

@googs1025, indeed, I cannot think of a weighty example where it is impossible to do without an environment variable with a restart number (the client code can always save the state in an external system or persistent storage) 😅. Perhaps only some kind of convenience for the user (compromise - convenience - antipattern)

belo4ya commented 2 weeks ago

do you mean you want to collect logs by pod's restart index

@Monokaix, yes, in order to distinguish the logs of different Job launches (restarts) (according to the maxRetry functionality)

Monokaix commented 2 weeks ago

do you mean you want to collect logs by pod's restart index

@Monokaix, yes, in order to distinguish the logs of different Job launches (restarts) (according to the maxRetry functionality)

As far as I know,when a container restarts,a new indexed log file will be generated by kubelet,can this meet your requirement?

belo4ya commented 2 weeks ago

Yes, I can obtain the Pod ID or Container ID from /var/log/containers/<pod-name>_<namespace>_<container-name-container-id>.log or /var/log/pods/<namespace>_<pod_name>_<pod_id>/<container_name>/

However, I still cannot correlate the running Pod ID with the RetryCount. For this, I need to have a relationship: jobs(id) -> pods(id, job_id, retry_count)

Can I obtain this relationship without information about which restart the pod belongs to (e.g., in the form of a label on the Pod)? (in the log collector context)

Monokaix commented 2 weeks ago

Yes, I can obtain the Pod ID or Container ID from /var/log/containers/<pod-name>_<namespace>_<container-name-container-id>.log or /var/log/pods/<namespace>_<pod_name>_<pod_id>/<container_name>/

However, I still cannot correlate the running Pod ID with the RetryCount. For this, I need to have a relationship: jobs(id) -> pods(id, job_id, retry_count)

Can I obtain this relationship without information about which restart the pod belongs to (e.g., in the form of a label on the Pod)? (in the log collector context)

When a container died, kubelet will restart it if pod's restartPolicy is always, and will generate a new log file with suffix index, like /var/log/pods/volcano-system_volcano-scheduler-754d7c6b88-mpqdl_d4ca8ea8-372e-4c68-b826-dc418c73b232/volcano-scheduler/1.log, you can inspect the container and find the logPath.

belo4ya commented 2 weeks ago

Yes, but this is not about the usual pod restarts with restartPolicy: Always. It is about the restarts of Job using the maxRetry + policy: RestartJob mechanism. When pods are recreated by the Volcano controller, the Job's RetryCount increases. Moreover, the pod id alone does not contain information about the restart sequence number. It is difficult to determine exactly (without retrospective data) which pod id corresponds to which restart number (RetryCount).

Monokaix commented 2 weeks ago

Yes, but this is not about the usual pod restarts with restartPolicy: Always. It is about the restarts of Job using the maxRetry + policy: RestartJob mechanism. When pods are recreated by the Volcano controller, the Job's RetryCount increases. Moreover, the pod id alone does not contain information about the restart sequence number. It is difficult to determine exactly (without retrospective data) which pod id corresponds to which restart number (RetryCount).

Seems that you wanna a pod level record of a job, if that's true, how about using downward API or pod label/annotation?plugin way seems a little heavy here: )

belo4ya commented 2 weeks ago

Thank you, the downward API is exactly what I needed!

      env:
        - name: VC_JOB_RETRY_COUNT
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['volcano.sh/retry-count']

The only thing left is to add an annotation volcano.sh/retry-count to the pod during its creation. Could this be added to the Volcano code when creating pods (for example, here pkg/controllers/job/job_controller_util.go::createJobPod)?

Monokaix commented 1 week ago

Thank you, the downward API is exactly what I needed!

      env:
        - name: VC_JOB_RETRY_COUNT
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['volcano.sh/retry-count']

The only thing left is to add an annotation volcano.sh/retry-count to the pod during its creation. Could this be added to the Volcano code when creating pods (for example, here pkg/controllers/job/job_controller_util.go::createJobPod)?

Yeah I think it's reasonable: )

belo4ya commented 1 week ago

/assign