rfeng2023 / mmcloud

0 stars 9 forks source link

Instance interrupted in the middle of restoration #65

Closed Ashley-Tung closed 4 months ago

Ashley-Tung commented 5 months ago

For these two jobs:

4t5l6mskvolyybz2172bh

dxa7uphemnjai3l0i05j6

they seem to have failed due to restoration being interrupted by a spot reclamation.

In the case of the first job, we can see that we are floating from 08c046fc8b004c29c to 0f4d875752b29b1cd but the latter got interrupted before the instance fully restored.

2024-03-20T00:01:30.19: Got spot interruption notice on i-08c046fc8b004c29c
2024-03-20T00:01:30.19: i-08c046fc8b004c29c is interrrupted, will recover
2024-03-20T00:01:30.194: Job status changed: [Executing -> Floating]
2024-03-20T00:01:30.194: workload needs float-data volume to checkpoint on i-08c046fc8b004c29c
2024-03-20T00:01:30.197: Ready to recover workload of i-08c046fc8b004c29c
2024-03-20T00:01:30.37: Ready to create float data volume to checkpoint
2024-03-20T00:01:30.37: Ready to create volume with size: 518, path: /mnt/float-data, skip disk types :[]
2024-03-20T00:01:33.195: workload needs float-data volume to checkpoint on i-08c046fc8b004c29c
2024-03-20T00:01:36.196: workload needs float-data volume to checkpoint on i-08c046fc8b004c29c
2024-03-20T00:01:36.995: Created volume vol-062d5eab3fd9c50bb(io1), size: 518, throughput: 0, iops: 25900
2024-03-20T00:01:39.197: workload needs float-data volume to checkpoint on i-08c046fc8b004c29c
2024-03-20T00:01:40.829: Attached volume vol-062d5eab3fd9c50bb to /mnt/float-data on instance i-08c046fc8b004c29c
2024-03-20T00:01:40.829: Created float data volume vol-062d5eab3fd9c50bb
2024-03-20T00:03:35.144: Instance i-08c046fc8b004c29c is down
2024-03-20T00:03:35.203: Attempt to find snapshot for job to recover
2024-03-20T00:03:35.203: No snapshots found for job to recover
2024-03-20T00:08:59.035: Detached volume vol-062d5eab3fd9c50bb from i-08c046fc8b004c29c
2024-03-20T00:08:59.406: Detached volume vol-0766bb562cea458da from i-08c046fc8b004c29c
2024-03-20T00:08:59.406: Ready to create new instance to recover
2024-03-20T00:09:05.233: Created instance i-0f4d875752b29b1cd(r5.16xlarge-Spot) at us-east-1c, waiting for it to initialize
2024-03-20T00:10:57.356: Registered one new host i-0f4d875752b29b1cd
2024-03-20T00:11:01.262: i-0f4d875752b29b1cd initialized
2024-03-20T00:11:01.262: Mounted s3:::us-east-1:statfungen:/ftp_fgc_xqtl:/home/aw3600/data:readonly to i-0f4d875752b29b1cd
2024-03-20T00:11:01.262: Mounted s3:::us-east-1:statfungen:/ftp_fgc_xqtl/analysis_result/mash_preprocessing:/home/aw3600/output:readwrite to i-0f4d875752b29b1cd
2024-03-20T00:11:01.262: Mounted vol-0766bb562cea458da:/mnt/float-image to i-0f4d875752b29b1cd
2024-03-20T00:11:01.262: Mounted vol-062d5eab3fd9c50bb:/mnt/float-data to i-0f4d875752b29b1cd
2024-03-20T00:11:01.265: Created new host: i-0f4d875752b29b1cd(Spot)
2024-03-20T00:11:01.815: Got 1 containers on host i-0f4d875752b29b1cd
2024-03-20T00:11:01.815: Ready to recover {ID:9ec7455d2835,Checkpointed:true,Running:false} on host i-0f4d875752b29b1cd
2024-03-20T00:11:01.821: Job is submitted to instance i-0f4d875752b29b1cd successfully
2024-03-20T00:11:01.865: Requeue job 4t5l6mskvolyybz2172bh
2024-03-20T00:11:01.902: Found previous container 9ec7455d2835 with checkpoint, ready to resume
2024-03-20T00:11:21.476: Got spot interruption notice on i-0f4d875752b29b1cd
2024-03-20T00:11:21.476: i-0f4d875752b29b1cd is interrrupted, but latest job status is Floating, no need to recover
2024-03-20T00:12:12.063: Failed to wait for workload restored on i-0f4d875752b29b1cd, error: Failed to run job, host i-0f4d875752b29b1cd interrupted (code: 9099)
2024-03-20T00:12:12.069: Failed to create new instance to recover, error: Failed to run job, host i-0f4d875752b29b1cd interrupted (code: 9099)
2024-03-20T00:12:12.069: Job status changed: [Floating -> FailToComplete

It is a rare situation to have the instance be reclaimed in the middle of restoration. As the instances are of type r5.16xlarge 64 Core 512 GB, this larger instance is more in-demand and thus more prone to interruption. It is possible that on the day of this job running, that there were even less of these instances available/demand for these instances were higher.

However, this is usually a rare enough situation where we don't have a solution for because of its rarity. Currently, Engineering is working on a solution where if a situation like this happens, we will simply rerun the job.

Ashley-Tung commented 5 months ago

[UPDATE] Fix for this has been merged. Expect to see it in 2.5.3.

Ashley-Tung commented 5 months ago

[UPDATE] The fix is currently in testing mode, thus did not make it to 2.5.3 update. Will let you know when it has been implemented

Ashley-Tung commented 4 months ago

[Update] Code currently being reviewed. Will likely be in 2.5.5. Another important note, as of today, the 23 and 54 east coast opcenters are still on 2.5.3

Ashley-Tung commented 4 months ago

[Update] Hi @gaow , we have confirmed a fix for this behavior in 2.5.5. The job will restore on another instance when interrupted in the middle of restoration. Let me know when you want to close this ticket. Thanks!

gaow commented 4 months ago

great!! Let's close it for now until it bothers us again (or never!)