Closed gaow closed 8 months ago
Hi @gaow , I see the 62 jobs that failed are from 2024-02-20 18:00:27 to 2024-02-20 18:30:50.
After going through all 62 jobs, I can break down the errors for you.
Over half of your errors from from the AWS error message You have exceeded your maximum gp3 storage limit of 50 TiB in this region
. There are some jobs that failed due to a failure to mount a volume, due to the gp3 quota. Such jobs include, but not limited to
The second-most error was Failed to restore: exit status 125 (code: 9011)
.
The error appears after this message:
Failed to get job status from i-0fd796901afaa6eab at #0 attempt, error: Get "https://172.31.15.227:443/api/v1/jobs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
This is a message from AWS's side. This means our request timedout on AWS's side, but usually there is no issue in retrying. But it looks like right after wards, the instance gets preempted, but no snapshot is taken likely because of the timeout. Therefore no snapshots for job to recover. When it attempts to resume the job, there is a failure to restore.
2024-02-21T10:39:06.101: Failed to get job status from i-0fd796901afaa6eab at #0 attempt, error: Get "https://172.31.15.227:443/api/v1/jobs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-02-21T10:39:06.305: Failed to poll job status from i-0fd796901afaa6eab, error Get "https://172.31.15.227:443/api/v1/jobs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-02-21T10:39:06.383: Instance i-0fd796901afaa6eab is preempted, try to recover host
...
2024-02-21T10:40:31.568: Job status changed: [Executing -> Floating]
2024-02-21T10:40:31.568: Ready to recover workload of i-0fd796901afaa6eab
2024-02-21T10:40:31.63: Instance i-0fd796901afaa6eab is down
2024-02-21T10:40:31.709: Attempt to find snapshot for job to recover
2024-02-21T10:40:31.709: No snapshots found for job to recover
...
2024-02-21T10:48:34.732: Requeue job 1gpcltvaamyseeqf15on1
2024-02-21T10:48:34.762: Found previous container 16106374e4a1 with checkpoint, ready to resume
2024-02-21T10:49:55.579: Failed to poll job status from i-0ea24631643a43335, error Failed to restore: exit status 125 (code: 9011)
2024-02-21T10:49:55.705: Job status changed: [Executing -> FailToComplete]. Mark job done.
2024-02-21T10:49:55.705: Job failed. Failed to restore: exit status 125 (code: 9011)
Such jobs include, but not limited to
What is happening is the workload floated, but got interrupted in the middle of restoration. This led to floating to the next instance. However, since restoration had already taken place, it was likely corrupted and ended in failure on the second float. This would be an unfortunate timing issue.
The error message is Failed to create spot instance request (code: 8133)
. The trend for this error is the waverides to an instance with larger memory, but fails to create a spot instance request. It then fails to attach the volume to the instance, but if we scroll up in the job.events
, the volume was already mounted to that very instance and was not unmounted.
2024-02-21T04:16:48.63: Mounted vol-05d01b851d513220a:/home/bst2126/input to i-054e5906ccc2f47e2
...
2024-02-21T05:53:49.439: Worker node is using swap file, need more memory on i-054e5906ccc2f47e2
2024-02-21T05:53:49.439: Try to determine wave-riding spec for CPU 4, mem 32, direction: 16, instType r5.xlarge, cpu: GenuineIntel, allowList: [], denyList: []
...
2024-02-21T05:55:01.843: Job status changed: [Executing -> Floating]
2024-02-21T05:55:01.843: Wave-riding from CPU 4 -> 8, mem 32 -> 64.
2024-02-21T05:55:46.412: workload on 172.31.14.199 are checkpointed at 2024-02-21T05:55:46.406Z on i-054e5906ccc2f47e2
2024-02-21T05:55:57.877: Failed to create instance to migrate. Error: Failed to create spot instance request (code: 8133)
...
2024-02-21T06:02:25.33: Failed to attach vol-05d01b851d513220a (/home/bst2126/input) on host i-054e5906ccc2f47e2 at (us-east-1c), error: exit status 1
2024-02-21T06:02:25.33: Job status changed: [Floating -> FailToComplete]. Mark job done.
2024-02-21T06:02:25.33: Failed to resume workload on original instance i-054e5906ccc2f47e2, error: exit status 1. Job failed
You have actually hit a very similar situation before (but the error code used to be Unsupported instance type (code: 8149)
) I have confirmed with Engineering that this scenario will be patched in 2.5.
All the jobs that have this error is:
The trend for this issue is there is a rebalance recommendation (which is on AWS’s side). At the same time, the instance is interrupted. Checkpoint has no error and it waits for a new instance to initialize, but then the host is not found
2024-02-24T00:48:04.545: Got rebalance recommendation notice on i-019a5f70b34156219
2024-02-24T00:48:04.545: i-019a5f70b34156219 is interrrupted, will recover
...
2024-02-24T00:52:09.973: Created instance i-0f0e98d2399e6908c(r5.4xlarge-Spot) at us-east-1c, waiting for it to initialize
2024-02-24T00:57:39.197: Job failed because Host not found (code: 8001)
Looking at opcenter.log
, it seems that the instance was registered but then terminated immediately
time="2024-02-24T00:57:41.522" level=info msg="Registered one new host" addr="172.31.0.107:443" cpu=16 host=ip-172-31-0-107.ec2.internal inst=i-0f0e98d2399e6908c instType=r5.4xlarge mem=128
time="2024-02-24T00:57:43.088" level=warning msg="Failed to wait the instance to be terminated" error="Timed out (code: 1023)" inst=i-0395a6e8db6f2fa17
time="2024-02-24T00:57:43.863" level=info msg="Ready to reclaim host" inst=i-0f0e98d2399e6908c
The only job that has this error:
The timestamp of this error coincides closely with the last Opcenter restart, which would lead to a communication issue and a Host not being found
[admin]$ float status
id: 7cfc0037-3ac4-41fe-999d-da8d278c3108
Server Status: normal
API Request Status: normal
License Status: valid
Init Time: "2024-01-16T19:41:42Z"
Up Time: "2024-02-24T00:57:10Z"
Current Time: "2024-02-27T00:47:14Z"
The time stamp of 2024-02-24T00:57...
coincides with the host not being found.
The error message is Wave-riding failed, error: Host is incompatible (code: 8005)
. This was an error you hit before. It says the r7i
is incompatible. These jobs were run on 2/20, which was before we got approval from AWS to support r7*
instances. You can check the timestamp of my slack message supporting this, as these instances should be supported now.
2024-02-21T04:55:50.322: Failed to create instance to migrate. Error: Unsupported instance type (code: 8149)
2024-02-21T04:55:50.322: Failed to create host with r7i.xlarge in us-east-1b, error Unsupported instance type (code: 8149). Try other intsance types
...
2024-02-21T04:55:50.787: Found candidate instance type r7iz.xlarge, price 0.14740000665187836
2024-02-21T04:55:50.813: Determined instance params: Zone:us-east-1b,InstType:r7i.xlarge,CPU:4,Memory:32,OnDemand:false
2024-02-21T04:55:50.82: Wave-riding failed, error: Host is incompatible (code: 8005).
2024-02-21T05:00:02.907: Failed to attach vol-002cf477bf507c770 (/mnt/float-data) on host i-041bc9a3c72db0e2b at (us-east-1b), error: exit status 1
2024-02-21T05:00:02.907: Job status changed: [Floating -> FailToComplete]. Mark job done.
Such jobs include, but not limited to
Thanks @Ashley-Tung this is impressively thorough and well organized!
It feels like there might be an engineering solution to Error 2?
For Error 4 if we can figure out why opcenter was rebooted it would help. I worry that it was rebooted because we submitted too many jobs? If that is the case it would make us worry about submitting too many jobs will "crash" the opcenter.
For error 2, Engineering suggested the cloud.handleRebalanceMemThreshold
in your config from 64
to 128
, which I have done for you. This will control how much memory is used in a job to trigger the rebalance recommendation notice from AWS. The larger value we set, the longer we will delay the handling. If AWS is worried their resources are under pressure, you will have more rebalance recommendations. Since we can handle floating in case of memory issues, AWS rebalance notices could be false alarms in a lot of cases. This will help Error 2 to appear less frequently, but the rebalance will still occur if a job is using 128GB of memory.
For error 4, currently working with Engineering to figure out the reason to the reboot.
Update from Slack, the latest hotfix will address Error 4
@Ashley-Tung we are rerunning some of the previously failed jobs, and have these errors so far:
Maybe these all belong to the issues that you summarized above that will be addressed in 2.5. But just in case some of those are new, could you take another look at it?
[Pasted from Slack]
Hi @gaow , I looked at the latest FailToComplete and NoAvailableHost jobs and both have the same underlying issue: not able to support the r7iz.2xlarge
instance
There's about 50 NoAvailableHost issues. One example is: https://23.22.157.8/#/opcenter/jobs/kq7yem28zt2z3rz6pmet4
The trend is Spot interruption -> new instance -> finds r7iz.2xlarge -> fail
There is no information on the job.events
log, but I found this is from the opcenter.log
time="2024-02-28T19:13:30.824" level=warning msg="Failed to create instance" error="UnsupportedOperation: The instance configuration for this AWS Marketplace product is not supported. Please see the AWS Marketplace site for more information about supported instance types, regions, and operating systems.\n\tstatus code: 400, request id: f45f5482-3bbe-4067-91af-be6403262d3e"
time="2024-02-28T19:13:30.824" level=error msg="Failed to create cloud instance" error="Unsupported instance type (code: 8149)"
time="2024-02-28T19:13:30.824" level=info msg="Failed to create new instance because: Unsupported instance type (code: 8149)" job=kq7yem28zt2z3rz6pmet4
time="2024-02-28T19:13:30.824" level=info msg="Job completed, ready to reclaim job resource" exitCode=-1 id=b450993232e4d9162feac9e77694cc5783ba11529476f88c99fede3d9ffff69e job=kq7yem28zt2z3rz6pmet4 status=NoAvailableHost
time="2024-02-28T19:13:30.84" level=info msg="Instance is terminated" inst=i-09e4e25a97c21b6ee
...
time="2024-02-28T19:13:31.058" level=error msg="Failed to replace host" error="Unsupported instance type (code: 8149)" spec=i-00640eb1b299324d5
There are 14 FailToComplete jobs. Th error comes from floating to the r7iz types. An example is here: https://23.22.157.8/#/opcenter/jobs/bx01uuq1kpg9iv6ussl92
2024-02-28T02:28:22.029: Determined instance params: Zone:us-east-1a,InstType:r7iz.2xlarge,CPU:8,Memory:64,OnDemand:false
2024-02-28T02:28:22.034: Wave-riding failed, error: Host is incompatible (code: 8005).
2024-02-28T02:32:42.761: Failed to attach vol-00e96d2f9561581c5 (/mnt/float-data) on host i-0435f2add7d8e8044 at (us-east-1a), error: exit status 1
2024-02-28T02:32:42.761: Job status changed: [Floating -> FailToComplete]. Mark job done.
2024-02-28T02:32:42.761: Failed to resume workload on original instance i-0435f2add7d8e8044, error: exit status 1. Job failed
While both scenarios resulted in different errors, the underlying cause is the same. I have reached out to engineering about this issue and they informed me that while AWS has whitelisted the r7* types for us, r7iz is a fairly new instance type and we actually cannot checkpoint/restore on it. Our engineers are already aware of this issue and will upgrade our checkpointing/restoring mechanism in 2.5.1
For now, I recommend adding the r7iz types to the deny list. The option does not appear in the GUI (I can ask for a feature request on this), but it exists under the variable provider.denyList, so it can be used as
float config set provider.denyList "r7iz*"
There are many FailToComplete jobs if you search on the status in our new opcenter. Here is an example,
https://23.22.157.8/#/log?jobId=uudnsay4x9amv240jna6y&logName=job.events
it says:
Again, for what seem to fail because of
float
and not the user, it would be great iffloat
can keep trying a bit more.