Closed gaow closed 5 months ago
This is documentation from job.events
:
We float from a r5.xlarge to r5.2xlarge due to high memory usage
024-03-11T06:22:20.186: Job status changed: [Executing -> Floating]
2024-03-11T06:22:20.186: Wave-riding from CPU 4 -> 8, mem 32 -> 64.
2024-03-11T06:25:29.841: workload on 172.31.42.83 are checkpointed at 2024-03-11T06:25:29.834Z on i-0602715cc0d4142b3
2024-03-11T06:25:29.975: Job is being migrating/optimizing now, check whether event can be handled
2024-03-11T06:25:29.975: No need to handle interruption as job is floating to new host
2024-03-11T06:25:29.975: Mark host status directly to skip checkpoint
2024-03-11T06:25:37.896: Created instance i-0a5c7e8363558ac6c(r5.2xlarge-Spot) at us-east-1b, waiting for it to initialize
Workload suspended - TODO: confirm with engineering on significance
Skip checkpointing as host status is workload_suspended, vm-state is 1
Failed to mount data volume
2024-03-11T06:33:42.881: Failed to attach vol-09298f6013808e71f (/mnt/float-data) on host i-0a5c7e8363558ac6c at (us-east-1b), error: exit status 1
2024-03-11T06:33:42.881: Failed to migrate to i-0a5c7e8363558ac6c. Error: exit status 1. Reclaim new host.
2024-03-11T06:34:16.46: Failed to migrate: exit status 1. Try to resume workload on original instance i-0602715cc0d4142b3.
2024-03-11T06:34:20.183: Attached vol-019fabe2beb977a87 (/mnt/float-image) on host i-0602715cc0d4142b3 at us-east-1b
2024-03-11T06:34:23.346: Attached vol-067bb9d55ad86d9cb (/home/bst2126/input) on host i-0602715cc0d4142b3 at us-east-1b
2024-03-11T06:40:52.067: Failed to attach vol-09298f6013808e71f (/mnt/float-data) on host i-0602715cc0d4142b3 at (us-east-1b), error: exit status 1
2024-03-11T06:40:52.067: Job status changed: [Floating -> FailToComplete]. Mark job done.
2024-03-11T06:40:52.068: Failed to resume workload on original instance i-0602715cc0d4142b3, error: exit status 1. Job failed
2024-03-11T06:40:52.158: Wave-riding failed, error: exit status 1.
2024-03-11T06:40:54.411: Ready to reclaim host i-0602715cc0d4142b3
2024-03-11T06:40:58.656: Detached volume vol-09298f6013808e71f from i-0602715cc0d4142b3
2024-03-11T06:41:28.387: Ready to reclaim volume vol-09298f6013808e71f
2024-03-11T06:41:28.535: Ready to reclaim volume vol-019fabe2beb977a87
2024-03-11T06:41:28.648: Ready to reclaim volume vol-067bb9d55ad86d9cb
2024-03-11T06:42:06.139: Detached volume vol-019fabe2beb977a87 from i-0602715cc0d4142b3
2024-03-11T06:42:06.445: Detached volume vol-067bb9d55ad86d9cb from i-0602715cc0d4142b3
2024-03-11T06:43:06.45: No container found on host i-0602715cc0d4142b3, error Get "https://172.31.42.83:443/api/v1/containers": context deadline exceeded (Client.Timeout exceeded while awaiting headers), rerun the job
2024-03-11T06:45:18.564: Failed to resume workload on original instance i-0602715cc0d4142b3, error: Post "https://172.31.42.83:443/api/v1/jobs": dial tcp 172.31.42.83:443: connect: connection timed out. Job failed
Recent failures all have to do with floating which previously we set our job to be not stingy on the specified memory to avoid float. I see now that if lots of jobs try to float the same time there seems to be more likely issues. Looks like float need more stress test in this situation
In terms of float testing, we have stress tested with 512 float-heavy workload at once before, which should cover the intensity of the tests here. I have already submitted a ticket for Engineering to take a look at this
[UPDATE] A PR has been merged addressing this. Originally, the data volume attempted to attach to a new host that became inaccessible, then it attempted to fall back to the original host while attaching to a new host. This is a timing issue that had been addressed. Waiting on confirmation on engineering if this will be in 2.5.1
[UPDATE] The PR is now expected to be a part of 2.5.2
Hi @gaow , we have a fix for this provided in 2.5.2. The root cause for this issue was that when we send a signal to AWS API, we got a timeout error from their side. This is not necessarily an issue with MMC, but we have added more retries should this case happen again.
As of now, both opcenters are 2.5.1. CC'ing @rfeng2023 about this as well
@gaow can you close this? This was fixed in 2.5.2
Hi @Ashley-Tung I think this error show again in v2.5.6
job w0o1bmyqarwg0e2tjnsyb on opcenter 44.222.241.133 :
https://44.222.241.133/#/opcenter/jobs/w0o1bmyqarwg0e2tjnsyb
2024-07-08T19:11:55.63: Failed to poll job status from i-0729733e0191760a0, error Get "https://10.1.22.122:443/api/v1/jobs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Hi @rfeng2023 ,
The original cause of the issue was due to AWS API sending us a timeout error from their side. This is not necessarily an issue with MMC, and more retries should be put in place already. Could you resubmit this job, since it only ran for 30 minutes, and let's see if the error persists? cc @yiweizh-memverge
@Ashley-Tung in our latest run:
https://23.22.157.8/#/log?jobId=99g0zwlb2638n1knjmxpc&logName=job.events