Closed gaow closed 8 months ago
At this point, there are 56 of them:
This is particularly annoying because this batch of jobs are long-running jobs using larger instances. They had this NoHostAvailable error after 8 hours of running, without any output generated. That is a big waste of dollars.
This is related to #43. r7* instances should be available now
As discussed, for this type of error
float
should keep retrying until it works, or fail due to other more fetal errors. This should not be the reason that a job gets cancelled.