rfeng2023 / mmcloud

1 stars 10 forks source link

No available host error #45

Closed gaow closed 8 months ago

gaow commented 8 months ago
2024-02-21T12:09:57.002: Instance i-0f0da944bdc762af7 is down
2024-02-21T12:09:57.103: Try to detach volumes from i-0f0da944bdc762af7
2024-02-21T12:09:57.103: Terminate i-0f0da944bdc762af7 actively as instance is alive
2024-02-21T12:10:03.599: Detached volume vol-021b2155a6acdb6cb from i-0f0da944bdc762af7
2024-02-21T12:10:46.655: Detached volume vol-07235f8f26bffe602 from i-0f0da944bdc762af7
2024-02-21T12:10:47.002: Detached volume vol-07cd23bca97805c51 from i-0f0da944bdc762af7
2024-02-21T12:11:11.446: i-0f0da944bdc762af7 has been terminated
2024-02-21T12:11:49.547: Detached volume vol-07cd23bca97805c51 from i-0f0da944bdc762af7
2024-02-21T12:11:49.808: Detached volume vol-021b2155a6acdb6cb from i-0f0da944bdc762af7
2024-02-21T12:11:50.083: Detached volume vol-07235f8f26bffe602 from i-0f0da944bdc762af7
2024-02-21T12:11:50.083: Ready to create new instance to recover
2024-02-21T12:11:50.574: Job status changed: [Floating -> NoAvailableHost]. Mark job done.
2024-02-21T12:11:50.574: Failed to create new instance because: Unsupported instance type (code: 8149)
2024-02-21T12:11:50.575: Job completed, ready to reclaim job resource
2024-02-21T12:11:50.647: Instance specification changed during the job, do not generate insights
2024-02-21T12:13:05.965: Ready to reclaim volume vol-07cd23bca97805c51
2024-02-21T12:13:43.572: Ready to reclaim volume vol-021b2155a6acdb6cb
2024-02-21T12:14:20.673: Ready to reclaim volume vol-07235f8f26bffe602

As discussed, for this type of error float should keep retrying until it works, or fail due to other more fetal errors. This should not be the reason that a job gets cancelled.

gaow commented 8 months ago

At this point, there are 56 of them:

image

This is particularly annoying because this batch of jobs are long-running jobs using larger instances. They had this NoHostAvailable error after 8 hours of running, without any output generated. That is a big waste of dollars.

Ashley-Tung commented 8 months ago

This is related to #43. r7* instances should be available now