Just curious does ray-llm fully leverage ray serve autoscaling (https://docs.ray.io/en/latest/serve/autoscaling-guide.html)?
Seems ray serve only support target_num_ongoing_requests_per_replica and max_concurrent_queries, As we know, LLM output varies and these are not good for LLM scenarios. how do you achieve better autoscaling support for LLM?
Just curious does ray-llm fully leverage ray serve autoscaling (https://docs.ray.io/en/latest/serve/autoscaling-guide.html)? Seems ray serve only support
target_num_ongoing_requests_per_replica
andmax_concurrent_queries
, As we know, LLM output varies and these are not good for LLM scenarios. how do you achieve better autoscaling support for LLM?