rfeng2023 / mmcloud

0 stars 9 forks source link

Connection time out #60

Closed gaow closed 5 months ago

gaow commented 6 months ago

@Ashley-Tung in our latest run:

https://23.22.157.8/#/log?jobId=99g0zwlb2638n1knjmxpc&logName=job.events

Ashley-Tung commented 6 months ago

This is documentation from job.events:

We float from a r5.xlarge to r5.2xlarge due to high memory usage

024-03-11T06:22:20.186: Job status changed: [Executing -> Floating]
2024-03-11T06:22:20.186: Wave-riding from CPU 4 -> 8, mem 32 -> 64.
2024-03-11T06:25:29.841: workload on 172.31.42.83 are checkpointed at 2024-03-11T06:25:29.834Z on i-0602715cc0d4142b3
2024-03-11T06:25:29.975: Job is being migrating/optimizing now, check whether event can be handled
2024-03-11T06:25:29.975: No need to handle interruption as job is floating to new host
2024-03-11T06:25:29.975: Mark host status directly to skip checkpoint
2024-03-11T06:25:37.896: Created instance i-0a5c7e8363558ac6c(r5.2xlarge-Spot) at us-east-1b, waiting for it to initialize

Workload suspended - TODO: confirm with engineering on significance

Skip checkpointing as host status is workload_suspended, vm-state is 1

Failed to mount data volume

2024-03-11T06:33:42.881: Failed to attach vol-09298f6013808e71f (/mnt/float-data) on host i-0a5c7e8363558ac6c at (us-east-1b), error: exit status 1
2024-03-11T06:33:42.881: Failed to migrate to i-0a5c7e8363558ac6c. Error: exit status 1. Reclaim new host.
2024-03-11T06:34:16.46: Failed to migrate: exit status 1. Try to resume workload on original instance i-0602715cc0d4142b3.
2024-03-11T06:34:20.183: Attached vol-019fabe2beb977a87 (/mnt/float-image) on host i-0602715cc0d4142b3 at us-east-1b
2024-03-11T06:34:23.346: Attached vol-067bb9d55ad86d9cb (/home/bst2126/input) on host i-0602715cc0d4142b3 at us-east-1b
2024-03-11T06:40:52.067: Failed to attach vol-09298f6013808e71f (/mnt/float-data) on host i-0602715cc0d4142b3 at (us-east-1b), error: exit status 1
2024-03-11T06:40:52.067: Job status changed: [Floating -> FailToComplete]. Mark job done.
2024-03-11T06:40:52.068: Failed to resume workload on original instance i-0602715cc0d4142b3, error: exit status 1. Job failed
2024-03-11T06:40:52.158: Wave-riding failed, error: exit status 1.
2024-03-11T06:40:54.411: Ready to reclaim host i-0602715cc0d4142b3
2024-03-11T06:40:58.656: Detached volume vol-09298f6013808e71f from i-0602715cc0d4142b3
2024-03-11T06:41:28.387: Ready to reclaim volume vol-09298f6013808e71f
2024-03-11T06:41:28.535: Ready to reclaim volume vol-019fabe2beb977a87
2024-03-11T06:41:28.648: Ready to reclaim volume vol-067bb9d55ad86d9cb
2024-03-11T06:42:06.139: Detached volume vol-019fabe2beb977a87 from i-0602715cc0d4142b3
2024-03-11T06:42:06.445: Detached volume vol-067bb9d55ad86d9cb from i-0602715cc0d4142b3
2024-03-11T06:43:06.45: No container found on host i-0602715cc0d4142b3, error Get "https://172.31.42.83:443/api/v1/containers": context deadline exceeded (Client.Timeout exceeded while awaiting headers), rerun the job
2024-03-11T06:45:18.564: Failed to resume workload on original instance i-0602715cc0d4142b3, error: Post "https://172.31.42.83:443/api/v1/jobs": dial tcp 172.31.42.83:443: connect: connection timed out. Job failed
gaow commented 6 months ago

Recent failures all have to do with floating which previously we set our job to be not stingy on the specified memory to avoid float. I see now that if lots of jobs try to float the same time there seems to be more likely issues. Looks like float need more stress test in this situation

Ashley-Tung commented 6 months ago

In terms of float testing, we have stress tested with 512 float-heavy workload at once before, which should cover the intensity of the tests here. I have already submitted a ticket for Engineering to take a look at this

Ashley-Tung commented 6 months ago

[UPDATE] A PR has been merged addressing this. Originally, the data volume attempted to attach to a new host that became inaccessible, then it attempted to fall back to the original host while attaching to a new host. This is a timing issue that had been addressed. Waiting on confirmation on engineering if this will be in 2.5.1

Ashley-Tung commented 5 months ago

[UPDATE] The PR is now expected to be a part of 2.5.2

Ashley-Tung commented 5 months ago

Hi @gaow , we have a fix for this provided in 2.5.2. The root cause for this issue was that when we send a signal to AWS API, we got a timeout error from their side. This is not necessarily an issue with MMC, but we have added more retries should this case happen again.

As of now, both opcenters are 2.5.1. CC'ing @rfeng2023 about this as well

Ashley-Tung commented 5 months ago

@gaow can you close this? This was fixed in 2.5.2

rfeng2023 commented 2 months ago

Hi @Ashley-Tung I think this error show again in v2.5.6

job w0o1bmyqarwg0e2tjnsyb on opcenter 44.222.241.133 :
https://44.222.241.133/#/opcenter/jobs/w0o1bmyqarwg0e2tjnsyb

2024-07-08T19:11:55.63: Failed to poll job status from i-0729733e0191760a0, error Get "https://10.1.22.122:443/api/v1/jobs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Ashley-Tung commented 2 months ago

Hi @rfeng2023 ,

The original cause of the issue was due to AWS API sending us a timeout error from their side. This is not necessarily an issue with MMC, and more retries should be put in place already. Could you resubmit this job, since it only ran for 30 minutes, and let's see if the error persists? cc @yiweizh-memverge