[ENHANCEMENT] Improve Error Messaging for Insufficient Resources in Aana Deploy

mobiusml / aana_sdk

Aana SDK is a powerful framework for building AI enabled multimodal applications.

Apache License 2.0

26 stars 3 forks source link

Enhancement Description

Currently, during the deployment of an application using Aana, if there are insufficient resources, the deployment process does not fail but only logs a warning message. This message, such as WARNING 2024-06-25 10:07:09,531 controller 43559 deployment_state.py:2147 - Deployment 'WhisperDeployment' in application 'whisper_deployment_medium' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 1.0, "GPU": 0.25}, total resources available: {"CPU": 64.0}. Use 'ray status' for more details., is not prominently displayed, causing users to wait indefinitely without knowing the issue. This enhancement aims to improve the visibility and clarity of these error messages, especially during development when resource availability issues are more likely to occur.

Advantages

Clearer error messages will help developers quickly identify and resolve resource-related issues providing immediate feedback on resource constraints will streamline the development process, making it more efficient and less frustrating.

Possible Implementation

Implement a mechanism to detect insufficient resources and trigger an immediate error message rather than a hidden warning.

Available methods:

use ray.available_resources(), it will get all the resources of the cluster. The available resources can be part of multiple nodes. For example, 1 available GPU can be a sum of 10 GPU servers with 10% free GPU memory. so the new model can not be deployed on the cluster.

use ray.cluster_resources(), same as ray.available_resources().

use ray.nodes()[0]['Resources'], It will get us the resources of each node, so we can check if enough resources for the new deployment is available or not in each node.

For clearify the logs: We can set the log level to Error The ray-tune has an option (--log-color) which is not available for ray In ray init we can specify the log_format but it just changes string format of the logs and not the color It seems we can override the ray logger but because each worker sends its log to the head, it just print in black

(ServeController pid=2236767) WARNING 2024-08-01 00:23:29,396 controller 2236767 deployment_state.py:2147 - Deployment 'StableDiffusion2Deployment' in application 'image_generation_deployment' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 1.0, "GPU": 2.0}, total resources available: {"CPU": 19.0, "GPU": 1.0}. Use `ray status` for more details.
(autoscaler +18m58s) Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 2.0}. Add suitable node types to this cluster to resolve this issue.

mobiusml / aana_sdk