ClusterScaleUpFailed for clusters with fixed size is confusing

YevheniiSemendiak commented 3 years ago

STR:

use the cluster of a fixed size (e.g. on-premise deployment)
all nodes of the desired preset are occupied
launch job and wait until the scheduling timeout reached

Actual result: the job is failed with the ClusterScaleUpFailed status, which is confusing (we even cannot scale up in on-prem).

Desired result: the failure status should be bind with the previous "pending" status, something like SchedulingFailed.

To make it less confusing, we could add some hints on what is the reason and what to check, like SchedulingFailed - not enough resources. As an alternative, in Slack thread we agreed to describe exit codes in docs.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 14 days

neuro-inc / platform-api

ClusterScaleUpFailed for clusters with fixed size is confusing #1711