Attach K8s Workload Events to App Trace

thisthat commented 4 months ago

Goal

K8s-generated events about the Application deployment should be attached to the trace generated by Keptn.

Details

Keptn provides a unified trace that describes what's happening in your K8s cluster when users deploy applications on it. If something goes off, Keptn doesn't provide much information besides the trace being terminated with an error state. It would be better to have also K8s Event information enclosed to the failed trace to debug and discover the root cause directly in a single source of truth. OTel has already support for Events, which makes it a perfect fit for us.

Since Keptn starts the KeptnAppVersion span before any K8s controller can take a Workload CR and ends it after K8s controllers finish handling a Workload CR, that Span makes the perfect fit to include all events.

Acceptance Criteria

K8s events of workload being part of a KeptnApp delivery are attached to the KeptnAppVersion Span
K8s Event timestamp have the same OTel Event timestamp (firstTimestamp field)
The OTel Event should follow the K8s SemConv to describe the workload info extracted from the K8s Event (involvedObject and metadata fields)
message, reason, and type are added as attributes.

DoD

K8s generated events for the workload under deployment by Keptn are attached to Keptn Spans

### Tasks
- [ ] Research: Define list of events that Keptn should observe

odubajDT commented 4 months ago

INITIAL THOUGHTS:

There are 2 questions that come to my mind when looking deeper into this ticket.

Do we want to attach k8s Events to the traces only when the deployment fails? This means only when the WorkloadDeploy phase fails.

Does it make sense to attach the k8s Events to the app trace? I would suggest adding the information to span representing the failed phase (same as the information is added to the pre/post-deployment phases in the case the phases failed). I would suggest attach the k8s Events the the span representing the WorkloadDeploy phase

odubajDT commented 4 months ago

List of Events that are available during deployment of workloads (Pod, Deployment, ReplicaSet, StatefulSet, DaemonSet):

Deployment:

Normal events:

ScalingReplicaSet: Indicates scaling of a ReplicaSet due to the deployment.
SuccessfulCreate: Indicates successful creation of a new deployment.
SuccessfulDelete: Indicates successful deletion of a deployment.

Warning events:

FailedCreate: Indicates a failure to create a new deployment.
FailedDelete: Indicates a failure to delete a deployment.

ReplicaSet:

Normal events:

SuccessfulCreate: Indicates successful creation of a ReplicaSet.
SuccessfulDelete: Indicates successful deletion of a ReplicaSet.

Warning events:

FailedCreate: Indicates a failure to create a ReplicaSet.
FailedDelete: Indicates a failure to delete a ReplicaSet.

StatefulSet:

Normal events:

SuccessfulCreate: Indicates successful creation of a StatefulSet.
SuccessfulDelete: Indicates successful deletion of a StatefulSet.

Warning events:

FailedCreate: Indicates a failure to create a StatefulSet.
FailedDelete: Indicates a failure to delete a StatefulSet.

DaemonSet:

Normal events:

SuccessfulCreate: Indicates successful creation of a DaemonSet.
SuccessfulDelete: Indicates successful deletion of a DaemonSet.

Warning events:

FailedCreate: Indicates a failure to create a DaemonSet.
FailedDelete: Indicates a failure to delete a DaemonSet.

Pod:

Normal events:

Scheduled: Indicates successful scheduling of a pod onto a node.
Pulled: Indicates successful pulling of the pod's container image.
Created: Indicates successful creation of a pod.
Started: Indicates successful start of the pod's containers.
Killing: Indicates termination of the pod due to user request or scaling.
SuccessfulMountVolume: Indicates successful mounting of a volume for the pod.
BackOff: Indicates that a container in the pod is repeatedly crashing.

Warning events:

FailedScheduling: Indicates a failure to schedule the pod onto any node.
FailedCreate: Indicates a failure to create the pod.
FailedAttachVolume: Indicates a failure to attach a volume to the pod.
FailedMount: Indicates a failure to mount a volume for the pod.
FailedSync: Indicates a failure to synchronize the pod's status.

keptn / lifecycle-toolkit