Seeing multiple delete attempts on the same job ID.
Seeing many deleteJob log events and very few others.
Steps to Reproduce the Bug
The MCAD log stats come from the log file year=2023/month=11/day=13/97741eb604.2023-11-13.2300.json.gz in dipc-prod-logs. It covers the 1 hour time period from 2023-11-13 23:00:00 to 2023-11-13 23:59:59. They are also summarized below.
Here is the stats summary by log event type:
MCAD Log Event Type | # Log Events
deleteJob | 58423
processCleanupJob | 318
Unknown | 293
UpdatePod | 251
AddPod | 67
Here are the Top5 results of repeated job logs on the same job ID:
Job ID | # Log Events
66d95bbd-e9ca-40ed-966e-863a5f60a8d1 | 2807
1a839594-a273-46ff-b83c-824e11645ba0 | 2740
a03b1fbb-0116-42d6-a822-1f09bd2b0238 | 2160
e73488e7-8a41-49f3-94a3-5a4f51d03f93 | 2160
4e6dc4ba-41b6-49b9-bd50-a6fc3e818349 | 2160
What Have You Already Tried to Debug the Issue?
My understanding is that MCAD reports repeated attempts to delete a job, even though it has already been deleted.
Expected Behavior
MCAD controller logs accurately reflect job handling on Vela cluster.
Additional Context
Add as applicable and when known:
Cloud: IBM COS dipc-prod-logs. See here for access.
Describe the Bug
Steps to Reproduce the Bug
The MCAD log stats come from the log file
year=2023/month=11/day=13/97741eb604.2023-11-13.2300.json.gz
in dipc-prod-logs. It covers the 1 hour time period from 2023-11-13 23:00:00 to 2023-11-13 23:59:59. They are also summarized below.Here is the stats summary by log event type: MCAD Log Event Type | # Log Events deleteJob | 58423 processCleanupJob | 318 Unknown | 293 UpdatePod | 251 AddPod | 67
Here are the Top5 results of repeated job logs on the same job ID: Job ID | # Log Events 66d95bbd-e9ca-40ed-966e-863a5f60a8d1 | 2807 1a839594-a273-46ff-b83c-824e11645ba0 | 2740 a03b1fbb-0116-42d6-a822-1f09bd2b0238 | 2160 e73488e7-8a41-49f3-94a3-5a4f51d03f93 | 2160 4e6dc4ba-41b6-49b9-bd50-a6fc3e818349 | 2160
What Have You Already Tried to Debug the Issue?
My understanding is that MCAD reports repeated attempts to delete a job, even though it has already been deleted.
Expected Behavior
MCAD controller logs accurately reflect job handling on Vela cluster.
Additional Context
Add as applicable and when known: