rfeng2023 / mmcloud

1 stars 10 forks source link

Failed to restore error lead to failed jobs #51

Closed gaow closed 8 months ago

gaow commented 8 months ago

There are many FailToComplete jobs if you search on the status in our new opcenter. Here is an example,

https://23.22.157.8/#/log?jobId=uudnsay4x9amv240jna6y&logName=job.events

it says:

2024-02-22T13:11:27.087: Ready to recover {ID:c644c5892127,Checkpointed:true,Running:false} on host i-05b107ddb653beed4
2024-02-22T13:11:27.092: Job is submitted to instance i-05b107ddb653beed4 successfully
2024-02-22T13:11:27.096: Job status changed: [Floating -> Executing]
2024-02-22T13:11:27.096: Job floated to instance i-05b107ddb653beed4 (16 CPU/128 GB) (Spot)
2024-02-22T13:11:27.128: Requeue job 5nxod4bny0b3xptjkrau3
2024-02-22T13:11:27.158: Found previous container c644c5892127 with checkpoint, ready to resume
2024-02-22T13:11:32.169: Requeue job 5nxod4bny0b3xptjkrau3
2024-02-22T13:11:32.199: Found previous container c644c5892127 with checkpoint, ready to resume
2024-02-22T13:11:32.594: Requeue job 5nxod4bny0b3xptjkrau3
2024-02-22T13:11:32.626: Found previous container c644c5892127 with checkpoint, ready to resume
2024-02-22T13:11:37.243: Failed to poll job status from i-05b107ddb653beed4, error Failed to restore: exit status 125 (code: 9011)
2024-02-22T13:11:37.304: Job status changed: [Executing -> FailToComplete]. Mark job done.
2024-02-22T13:11:37.304: Job failed. Failed to restore: exit status 125 (code: 9011)
2024-02-22T13:11:37.409: Instance specification changed during the job, do not generate insights
2024-02-22T13:11:37.719: Ready to reclaim host i-05b107ddb653beed4
2024-02-22T13:12:10.684: Ready to reclaim volume vol-0236217089a3d3431
2024-02-22T13:12:10.798: Ready to reclaim volume vol-05e3c3cd226de495b
2024-02-22T13:12:10.91: Ready to reclaim volume vol-0dc20861bb5ee0f2b

Again, for what seem to fail because of float and not the user, it would be great if float can keep trying a bit more.

Ashley-Tung commented 8 months ago

Hi @gaow , I see the 62 jobs that failed are from 2024-02-20 18:00:27 to 2024-02-20 18:30:50.

After going through all 62 jobs, I can break down the errors for you.

Error 1 - Volume Quota

Over half of your errors from from the AWS error message You have exceeded your maximum gp3 storage limit of 50 TiB in this region. There are some jobs that failed due to a failure to mount a volume, due to the gp3 quota. Such jobs include, but not limited to

Error 2 - Code 9011 - Feedback from Engineering

The second-most error was Failed to restore: exit status 125 (code: 9011). The error appears after this message:

Failed to get job status from i-0fd796901afaa6eab at #0 attempt, error: Get "https://172.31.15.227:443/api/v1/jobs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This is a message from AWS's side. This means our request timedout on AWS's side, but usually there is no issue in retrying. But it looks like right after wards, the instance gets preempted, but no snapshot is taken likely because of the timeout. Therefore no snapshots for job to recover. When it attempts to resume the job, there is a failure to restore.

2024-02-21T10:39:06.101: Failed to get job status from i-0fd796901afaa6eab at #0 attempt, error: Get "https://172.31.15.227:443/api/v1/jobs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-02-21T10:39:06.305: Failed to poll job status from i-0fd796901afaa6eab, error Get "https://172.31.15.227:443/api/v1/jobs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-02-21T10:39:06.383: Instance i-0fd796901afaa6eab is preempted, try to recover host
...
2024-02-21T10:40:31.568: Job status changed: [Executing -> Floating]
2024-02-21T10:40:31.568: Ready to recover workload of i-0fd796901afaa6eab
2024-02-21T10:40:31.63: Instance i-0fd796901afaa6eab is down
2024-02-21T10:40:31.709: Attempt to find snapshot for job to recover
2024-02-21T10:40:31.709: No snapshots found for job to recover
...
2024-02-21T10:48:34.732: Requeue job 1gpcltvaamyseeqf15on1
2024-02-21T10:48:34.762: Found previous container 16106374e4a1 with checkpoint, ready to resume
2024-02-21T10:49:55.579: Failed to poll job status from i-0ea24631643a43335, error Failed to restore: exit status 125 (code: 9011)
2024-02-21T10:49:55.705: Job status changed: [Executing -> FailToComplete]. Mark job done.
2024-02-21T10:49:55.705: Job failed. Failed to restore: exit status 125 (code: 9011)

Such jobs include, but not limited to

What is happening is the workload floated, but got interrupted in the middle of restoration. This led to floating to the next instance. However, since restoration had already taken place, it was likely corrupted and ended in failure on the second float. This would be an unfortunate timing issue.

Error 3 - Code 8133 - Can Ignore

The error message is Failed to create spot instance request (code: 8133). The trend for this error is the waverides to an instance with larger memory, but fails to create a spot instance request. It then fails to attach the volume to the instance, but if we scroll up in the job.events, the volume was already mounted to that very instance and was not unmounted.

2024-02-21T04:16:48.63: Mounted vol-05d01b851d513220a:/home/bst2126/input to i-054e5906ccc2f47e2
...
2024-02-21T05:53:49.439: Worker node is using swap file, need more memory on i-054e5906ccc2f47e2
2024-02-21T05:53:49.439: Try to determine wave-riding spec for CPU 4, mem 32, direction: 16, instType r5.xlarge, cpu: GenuineIntel, allowList: [], denyList: []
...
2024-02-21T05:55:01.843: Job status changed: [Executing -> Floating]
2024-02-21T05:55:01.843: Wave-riding from CPU 4 -> 8, mem 32 -> 64.
2024-02-21T05:55:46.412: workload on 172.31.14.199 are checkpointed at 2024-02-21T05:55:46.406Z on i-054e5906ccc2f47e2
2024-02-21T05:55:57.877: Failed to create instance to migrate. Error: Failed to create spot instance request (code: 8133)
...
2024-02-21T06:02:25.33: Failed to attach vol-05d01b851d513220a (/home/bst2126/input) on host i-054e5906ccc2f47e2 at (us-east-1c), error: exit status 1
2024-02-21T06:02:25.33: Job status changed: [Floating -> FailToComplete]. Mark job done.
2024-02-21T06:02:25.33: Failed to resume workload on original instance i-054e5906ccc2f47e2, error: exit status 1. Job failed

You have actually hit a very similar situation before (but the error code used to be Unsupported instance type (code: 8149)) I have confirmed with Engineering that this scenario will be patched in 2.5.

All the jobs that have this error is:

Error 4 - Code 8001 - Feedback from Engineering

The trend for this issue is there is a rebalance recommendation (which is on AWS’s side). At the same time, the instance is interrupted. Checkpoint has no error and it waits for a new instance to initialize, but then the host is not found

2024-02-24T00:48:04.545: Got rebalance recommendation notice on i-019a5f70b34156219
2024-02-24T00:48:04.545: i-019a5f70b34156219 is interrrupted, will recover
...
2024-02-24T00:52:09.973: Created instance i-0f0e98d2399e6908c(r5.4xlarge-Spot) at us-east-1c, waiting for it to initialize
2024-02-24T00:57:39.197: Job failed because Host not found (code: 8001)

Looking at opcenter.log, it seems that the instance was registered but then terminated immediately

time="2024-02-24T00:57:41.522" level=info msg="Registered one new host" addr="172.31.0.107:443" cpu=16 host=ip-172-31-0-107.ec2.internal inst=i-0f0e98d2399e6908c instType=r5.4xlarge mem=128
time="2024-02-24T00:57:43.088" level=warning msg="Failed to wait the instance to be terminated" error="Timed out (code: 1023)" inst=i-0395a6e8db6f2fa17
time="2024-02-24T00:57:43.863" level=info msg="Ready to reclaim host" inst=i-0f0e98d2399e6908c

The only job that has this error:

The timestamp of this error coincides closely with the last Opcenter restart, which would lead to a communication issue and a Host not being found

[admin]$ float status
id: 7cfc0037-3ac4-41fe-999d-da8d278c3108
Server Status: normal
API Request Status: normal
License Status: valid
Init Time: "2024-01-16T19:41:42Z"
Up Time: "2024-02-24T00:57:10Z"
Current Time: "2024-02-27T00:47:14Z"

The time stamp of 2024-02-24T00:57... coincides with the host not being found.

Error 5 - Code 8005 - Can Ignore

The error message is Wave-riding failed, error: Host is incompatible (code: 8005). This was an error you hit before. It says the r7i is incompatible. These jobs were run on 2/20, which was before we got approval from AWS to support r7* instances. You can check the timestamp of my slack message supporting this, as these instances should be supported now.

2024-02-21T04:55:50.322: Failed to create instance to migrate. Error: Unsupported instance type (code: 8149)
2024-02-21T04:55:50.322: Failed to create host with r7i.xlarge in us-east-1b, error Unsupported instance type (code: 8149). Try other intsance types
...
2024-02-21T04:55:50.787: Found candidate instance type r7iz.xlarge, price 0.14740000665187836
2024-02-21T04:55:50.813: Determined instance params: Zone:us-east-1b,InstType:r7i.xlarge,CPU:4,Memory:32,OnDemand:false
2024-02-21T04:55:50.82: Wave-riding failed, error: Host is incompatible (code: 8005).
2024-02-21T05:00:02.907: Failed to attach vol-002cf477bf507c770 (/mnt/float-data) on host i-041bc9a3c72db0e2b at (us-east-1b), error: exit status 1
2024-02-21T05:00:02.907: Job status changed: [Floating -> FailToComplete]. Mark job done.

Such jobs include, but not limited to

gaow commented 8 months ago

Thanks @Ashley-Tung this is impressively thorough and well organized!

It feels like there might be an engineering solution to Error 2?

For Error 4 if we can figure out why opcenter was rebooted it would help. I worry that it was rebooted because we submitted too many jobs? If that is the case it would make us worry about submitting too many jobs will "crash" the opcenter.

Ashley-Tung commented 8 months ago

For error 2, Engineering suggested the cloud.handleRebalanceMemThreshold in your config from 64 to 128, which I have done for you. This will control how much memory is used in a job to trigger the rebalance recommendation notice from AWS. The larger value we set, the longer we will delay the handling. If AWS is worried their resources are under pressure, you will have more rebalance recommendations. Since we can handle floating in case of memory issues, AWS rebalance notices could be false alarms in a lot of cases. This will help Error 2 to appear less frequently, but the rebalance will still occur if a job is using 128GB of memory.

For error 4, currently working with Engineering to figure out the reason to the reboot.

Ashley-Tung commented 8 months ago

Update from Slack, the latest hotfix will address Error 4

gaow commented 8 months ago

@Ashley-Tung we are rerunning some of the previously failed jobs, and have these errors so far:

image image

Maybe these all belong to the issues that you summarized above that will be addressed in 2.5. But just in case some of those are new, could you take another look at it?

Ashley-Tung commented 8 months ago

[Pasted from Slack]

Hi @gaow , I looked at the latest FailToComplete and NoAvailableHost jobs and both have the same underlying issue: not able to support the r7iz.2xlarge instance

There's about 50 NoAvailableHost issues. One example is: https://23.22.157.8/#/opcenter/jobs/kq7yem28zt2z3rz6pmet4 The trend is Spot interruption -> new instance -> finds r7iz.2xlarge -> fail There is no information on the job.events log, but I found this is from the opcenter.log

time="2024-02-28T19:13:30.824" level=warning msg="Failed to create instance" error="UnsupportedOperation: The instance configuration for this AWS Marketplace product is not supported. Please see the AWS Marketplace site for more information about supported instance types, regions, and operating systems.\n\tstatus code: 400, request id: f45f5482-3bbe-4067-91af-be6403262d3e"
time="2024-02-28T19:13:30.824" level=error msg="Failed to create cloud instance" error="Unsupported instance type (code: 8149)"
time="2024-02-28T19:13:30.824" level=info msg="Failed to create new instance because: Unsupported instance type (code: 8149)" job=kq7yem28zt2z3rz6pmet4
time="2024-02-28T19:13:30.824" level=info msg="Job completed, ready to reclaim job resource" exitCode=-1 id=b450993232e4d9162feac9e77694cc5783ba11529476f88c99fede3d9ffff69e job=kq7yem28zt2z3rz6pmet4 status=NoAvailableHost
time="2024-02-28T19:13:30.84" level=info msg="Instance is terminated" inst=i-09e4e25a97c21b6ee
...
time="2024-02-28T19:13:31.058" level=error msg="Failed to replace host" error="Unsupported instance type (code: 8149)" spec=i-00640eb1b299324d5

There are 14 FailToComplete jobs. Th error comes from floating to the r7iz types. An example is here: https://23.22.157.8/#/opcenter/jobs/bx01uuq1kpg9iv6ussl92

2024-02-28T02:28:22.029: Determined instance params: Zone:us-east-1a,InstType:r7iz.2xlarge,CPU:8,Memory:64,OnDemand:false
2024-02-28T02:28:22.034: Wave-riding failed, error: Host is incompatible (code: 8005).
2024-02-28T02:32:42.761: Failed to attach vol-00e96d2f9561581c5 (/mnt/float-data) on host i-0435f2add7d8e8044 at (us-east-1a), error: exit status 1
2024-02-28T02:32:42.761: Job status changed: [Floating -> FailToComplete]. Mark job done.
2024-02-28T02:32:42.761: Failed to resume workload on original instance i-0435f2add7d8e8044, error: exit status 1. Job failed

While both scenarios resulted in different errors, the underlying cause is the same. I have reached out to engineering about this issue and they informed me that while AWS has whitelisted the r7* types for us, r7iz is a fairly new instance type and we actually cannot checkpoint/restore on it. Our engineers are already aware of this issue and will upgrade our checkpointing/restoring mechanism in 2.5.1

For now, I recommend adding the r7iz types to the deny list. The option does not appear in the GUI (I can ask for a feature request on this), but it exists under the variable provider.denyList, so it can be used as

float config set provider.denyList "r7iz*"