Open belo4ya opened 3 weeks ago
Placing the RetryCount
value in Labels might be a priority since some log collectors may not support extracting data from Pod Annotations. For example, the fluent-plugin-kubernetes_metadata_filter does not seem support Pod annotations.
Hello, I'm curious about one thing. What is the purpose of adding a retry count label within the container itself? Typically, containerized applications are stateless, and there should be no distinction between the initial startup and a restart after an unexpected crash. For example, in the case of a job task for training a dataset, it wouldn't usually matter how many retries have occurred after an unexpected restart. Could you provide more context or clarify the specific use case where having a retry count label inside the container would be beneficial?
Hi, do you mean you want to collect logs by pod's restart index to distinguish different pods or containers' logs?
@googs1025, indeed, I cannot think of a weighty example where it is impossible to do without an environment variable with a restart number (the client code can always save the state in an external system or persistent storage) 😅. Perhaps only some kind of convenience for the user (compromise - convenience - antipattern)
do you mean you want to collect logs by pod's restart index
@Monokaix, yes, in order to distinguish the logs of different Job launches (restarts) (according to the maxRetry functionality)
do you mean you want to collect logs by pod's restart index
@Monokaix, yes, in order to distinguish the logs of different Job launches (restarts) (according to the maxRetry functionality)
As far as I know,when a container restarts,a new indexed log file will be generated by kubelet,can this meet your requirement?
Yes, I can obtain the Pod ID or Container ID from /var/log/containers/<pod-name>_<namespace>_<container-name-container-id>.log
or /var/log/pods/<namespace>_<pod_name>_<pod_id>/<container_name>/
However, I still cannot correlate the running Pod ID with the RetryCount
. For this, I need to have a relationship: jobs(id) -> pods(id, job_id, retry_count)
Can I obtain this relationship without information about which restart the pod belongs to (e.g., in the form of a label on the Pod)? (in the log collector context)
Yes, I can obtain the Pod ID or Container ID from
/var/log/containers/<pod-name>_<namespace>_<container-name-container-id>.log
or/var/log/pods/<namespace>_<pod_name>_<pod_id>/<container_name>/
However, I still cannot correlate the running Pod ID with the
RetryCount
. For this, I need to have a relationship:jobs(id) -> pods(id, job_id, retry_count)
Can I obtain this relationship without information about which restart the pod belongs to (e.g., in the form of a label on the Pod)? (in the log collector context)
When a container died, kubelet will restart it if pod's restartPolicy is always, and will generate a new log file with suffix index, like
/var/log/pods/volcano-system_volcano-scheduler-754d7c6b88-mpqdl_d4ca8ea8-372e-4c68-b826-dc418c73b232/volcano-scheduler/1.log
, you can inspect the container and find the logPath.
Yes, but this is not about the usual pod restarts with restartPolicy: Always
. It is about the restarts of Job using the maxRetry + policy: RestartJob
mechanism. When pods are recreated by the Volcano controller, the Job's RetryCount
increases. Moreover, the pod id alone does not contain information about the restart sequence number. It is difficult to determine exactly (without retrospective data) which pod id corresponds to which restart number (RetryCount
).
Yes, but this is not about the usual pod restarts with
restartPolicy: Always
. It is about the restarts of Job using themaxRetry + policy: RestartJob
mechanism. When pods are recreated by the Volcano controller, the Job'sRetryCount
increases. Moreover, the pod id alone does not contain information about the restart sequence number. It is difficult to determine exactly (without retrospective data) which pod id corresponds to which restart number (RetryCount
).
Seems that you wanna a pod level record of a job, if that's true, how about using downward API or pod label/annotation?plugin way seems a little heavy here: )
Thank you, the downward API is exactly what I needed!
env:
- name: VC_JOB_RETRY_COUNT
valueFrom:
fieldRef:
fieldPath: metadata.annotations['volcano.sh/retry-count']
The only thing left is to add an annotation volcano.sh/retry-count
to the pod during its creation. Could this be added to the Volcano code when creating pods (for example, here pkg/controllers/job/job_controller_util.go::createJobPod)?
Thank you, the downward API is exactly what I needed!
env: - name: VC_JOB_RETRY_COUNT valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/retry-count']
The only thing left is to add an annotation
volcano.sh/retry-count
to the pod during its creation. Could this be added to the Volcano code when creating pods (for example, here pkg/controllers/job/job_controller_util.go::createJobPod)?
Yeah I think it's reasonable: )
/assign
What would you like to be added:
Job.Status.RetryCount
in Pod's container environment variables (e.g.,VC_RETRY_COUNT: 2
)Job.Status.RetryCount
in Pod's annotations (and labels) (e.g.,volcano.sh/retry-count=2
)Why is this needed:
Log collection. Log collectors such as Vector, FluentBit, Fluentd, etc., can enrich log entries with additional data, particularly Pod labels and annotations.
Having the current
RetryCount
value in the Pod's labels would be convenient, as it can be extracted and stored by the log collector. This allows for the construction of a unique identifier for each individual run to filter logs:run_id = (job_id, retry_count)
.An environment variable with the
RetryCount
value can simplify writing idempotent code:It can also be used in custom logs for subsequent log filtering and analysis of differences between retries.
Implementation:
This could be an extension of the existing
controllers/job/plugins/env
plugin, a new plugincontrollers/job/plugins/retrycount
, or directly implemented when creating Pods in the controller. Which option would be preferable?Example of the implementation of the new
retryCount
plugin: