project-codeflare / multi-cluster-app-dispatcher

Holistic job manager on Kubernetes
Apache License 2.0
108 stars 63 forks source link

MCAD controller logs #690

Open ordavidov opened 1 year ago

ordavidov commented 1 year ago

Describe the Bug

Steps to Reproduce the Bug

The MCAD log stats come from the log file year=2023/month=11/day=13/97741eb604.2023-11-13.2300.json.gz in dipc-prod-logs. It covers the 1 hour time period from 2023-11-13 23:00:00 to 2023-11-13 23:59:59. They are also summarized below.

Here is the stats summary by log event type: MCAD Log Event Type | # Log Events deleteJob | 58423 processCleanupJob | 318 Unknown | 293 UpdatePod | 251 AddPod | 67

Here are the Top5 results of repeated job logs on the same job ID: Job ID | # Log Events 66d95bbd-e9ca-40ed-966e-863a5f60a8d1 | 2807 1a839594-a273-46ff-b83c-824e11645ba0 | 2740 a03b1fbb-0116-42d6-a822-1f09bd2b0238 | 2160 e73488e7-8a41-49f3-94a3-5a4f51d03f93 | 2160 4e6dc4ba-41b6-49b9-bd50-a6fc3e818349 | 2160

What Have You Already Tried to Debug the Issue?

My understanding is that MCAD reports repeated attempts to delete a job, even though it has already been deleted.

Expected Behavior

MCAD controller logs accurately reflect job handling on Vela cluster.

Additional Context

Add as applicable and when known: