rfeng2023 / mmcloud

0 stars 9 forks source link

Jobs stuck at Executing status #48

Closed gaow closed 6 months ago

gaow commented 6 months ago

@Ashley-Tung could you look into these jobs in our new opcenter,

image

their job events are like:

2024-02-23T05:04:54.24: Determined instance params: Zone:us-east-1b,InstType:r5.xlarge,CPU:4,Memory:32,OnDemand:false
2024-02-23T05:04:54.24: Job status changed: [Executing -> Floating]
2024-02-23T05:04:54.24: Wave-riding from CPU 2 -> 4, mem 16 -> 32.
2024-02-23T05:06:55.645: Wave-riding failed as checkpoint failed, error: Cloud VM has been terminated (code: 8150).
2024-02-23T05:06:55.645: Job status changed: [Floating -> Executing]
2024-02-23T05:07:14.664: workload on 172.31.36.226 are checkpointed at 2024-02-23T05:07:14.655Z on i-09b0b8b02752f20b5

They tried to float due to lack of memory, but got stuck on a checkpoint without moving forward further. Fortunately the cost does not seem to go up even if the status is Executing. Still, these are confusing.

gaow commented 6 months ago

currently it seems only 2 of them left:

image

So, it still works. Just taking quite long time to get back on track. For example these two have been in limbo since 13 hrs ago.

Ashley-Tung commented 6 months ago

[Pasted from Slack]: It looks like only three of your jobs are having this issue. There are other Executing jobs but they seem to be actually executing. The three jobs I see are s6er67nh3xtwr5f0s53sk, 6ma9xj7jyllm5itghiayp, and vw6a4h05titnsrh2hgaar Looking at your opcenter.log, it explains the error a little more

time="2024-02-23T05:06:55.645" level=error msg="failed to describe AWS instance" error="InvalidInstanceID.NotFound: The instance ID 'i-09b0b8b02752f20b5' does not exist\n\tstatus code: 400, request id: e811efd3-492d-4b58-bf91-ae2fb1c8b1d5"
time="2024-02-23T05:06:55.645" level=info msg="Wave-riding failed as checkpoint failed, error: Cloud VM has been terminated (code: 8150)." instance=i-09b0b8b02752f20b5 job=s6er67nh3xtwr5f0s53sk
...
time="2024-02-23T05:07:14.664" level=info msg="Ready to handle host event" spec="&{workSuspendedForOOM workload on 172.31.36.226 are checkpointed at 2024-02-23T05:07:14.655Z i-09b0b8b02752f20b5 0  map[] 0s}"

It seems the opcenter cannot find the instance id. This is strange to me, so I submitted a ticket for our engineering team to take a look at it. I suspect cost is not going up because the status is "workload_suspended," despite MMC shows it is in "Executing" mode.

After some time, it seems it the spot instance it was on got interrupted, which got the pipeline running again. This is for s6er67nh3xtwr5f0s53sk, which was the job you initially showed me. It seems to be fine right now

2024-02-23T05:06:55.645: Wave-riding failed as checkpoint failed, error: Cloud VM has been terminated (code: 8150).
2024-02-23T05:06:55.645: Job status changed: [Floating -> Executing]
2024-02-23T05:07:14.664: workload on 172.31.36.226 are checkpointed at 2024-02-23T05:07:14.655Z on i-09b0b8b02752f20b5
2024-02-23T21:25:44.138: Got spot interruption notice on i-09b0b8b02752f20b5
2024-02-23T21:25:44.138: i-09b0b8b02752f20b5 is interrrupted, will recover
2024-02-23T21:25:44.14: Job status changed: [Executing -> Floating]

6ma9xj7jyllm5itghiayp, another job with the same issue, completed. vw6a4h05titnsrh2hgaar is still stuck. It looks like around 12 hours of just being stuck. I would suggest waiting through that time for the last job. It seems it is not charging you either. I'll be sure to bring this up to the ticket I submitted for the engineering team

Ashley-Tung commented 6 months ago

Update from Slack, the latest hotfix will address this