Closed Ashley-Tung closed 4 months ago
[UPDATE] Fix for this has been merged. Expect to see it in 2.5.3.
[UPDATE] The fix is currently in testing mode, thus did not make it to 2.5.3 update. Will let you know when it has been implemented
[Update] Code currently being reviewed. Will likely be in 2.5.5. Another important note, as of today, the 23 and 54 east coast opcenters are still on 2.5.3
[Update] Hi @gaow , we have confirmed a fix for this behavior in 2.5.5. The job will restore on another instance when interrupted in the middle of restoration. Let me know when you want to close this ticket. Thanks!
great!! Let's close it for now until it bothers us again (or never!)
For these two jobs:
4t5l6mskvolyybz2172bh
dxa7uphemnjai3l0i05j6
they seem to have failed due to restoration being interrupted by a spot reclamation.
In the case of the first job, we can see that we are floating from 08c046fc8b004c29c to 0f4d875752b29b1cd but the latter got interrupted before the instance fully restored.
It is a rare situation to have the instance be reclaimed in the middle of restoration. As the instances are of type r5.16xlarge 64 Core 512 GB, this larger instance is more in-demand and thus more prone to interruption. It is possible that on the day of this job running, that there were even less of these instances available/demand for these instances were higher.
However, this is usually a rare enough situation where we don't have a solution for because of its rarity. Currently, Engineering is working on a solution where if a situation like this happens, we will simply rerun the job.