Inquiry Regarding Scalability Best Practices for FastAPI ML Model Deployment on Kubernetes

sayakpaul / ml-deployment-k8s-fastapi

This project shows how to serve an ONNX-optimized image classification model as a web service with FastAPI, Docker, and Kubernetes.

Apache License 2.0

198 stars 36 forks source link

Dear Sayak, Chansung, and Contributors,

First and foremost, I would like to extend my gratitude for the comprehensive guide on deploying machine learning models with FastAPI, Docker, and Kubernetes. The repository serves as an invaluable resource for practitioners aiming to operationalise their machine learning workflows in a cloud-native environment.

Upon perusing your documentation and workflow configurations, I have gathered substantial insights into the deployment process. However, I am particularly interested in understanding the scalability aspects of the deployment strategy in greater detail. As we are aware, machine learning workloads can be quite erratic in terms of resource consumption, and the ability to scale efficiently is paramount to maintaining performance and cost-effectiveness.

I am keen to learn about the following:

Auto-Scaling Practices: Could you elucidate on the auto-scaling strategies that one might employ with the current setup? Specifically, I am curious about the implementation of Horizontal Pod Autoscaling (HPA) and whether there are any recommended thresholds or metrics that we should monitor to trigger scaling events.
Load Balancing Considerations: With the deployment leveraging a LoadBalancer service type, how does the current configuration ensure even distribution of traffic amongst the pods, especially during a scaling event? Are there any particular load balancing algorithms or configurations that you would recommend?
Resource Quotas and Limits: In the context of Kubernetes, setting appropriate resource quotas and limits is crucial to prevent any single service from monopolising cluster resources. Could you provide guidance on setting these parameters in a way that balances resource utilisation and availability, particularly for machine learning inference services that may have variable resource demands?
Node Pool Management: The deployment utilises a cluster with a fixed number of nodes. In a production scenario, how would you approach the management of node pools to accommodate the scaling of pods? Is there a strategy in place to scale the node pool itself, and if so, what are the considerations for such a strategy?
Cost Management: Lastly, could you share any insights on managing costs associated with running such a deployment on GKE? Are there any best practices or tools that you would recommend for monitoring and optimising the costs of the compute resources utilised by the Kubernetes cluster?

I believe that addressing these queries would greatly benefit the community, providing a deeper understanding of how to manage and scale machine learning deployments effectively in Kubernetes.

Thank you for your time and consideration. I eagerly await your response and any further discussions this might engender.

Best regards, yihong1120

Appreciate the kind words.

Auto-Scaling Practices: Could you elucidate on the auto-scaling strategies that one might employ with the current setup? Specifically, I am curious about the implementation of Horizontal Pod Autoscaling (HPA) and whether there are any recommended thresholds or metrics that we should monitor to trigger scaling events.

This is contextual. At its core, I think it should depend on the number of requests served per second and the underlying model. You would want to:

Not cause memory errors for the model to operate. This concerns your accelerators (assuming you're using one).
Asynchronously batch the requests on a CPU but without overhauling it.

So, you would want to devise your utilization thresholds accordingly for scaling up.

Load Balancing Considerations: With the deployment leveraging a LoadBalancer service type, how does the current configuration ensure even distribution of traffic amongst the pods, especially during a scaling event? Are there any particular load balancing algorithms or configurations that you would recommend?

We didn't dig deeper here as the idea was to provide a simple yet robust workflow to deploy models. But services like Vertex AI automatically take care of this for you.

Resource Quotas and Limits: In the context of Kubernetes, setting appropriate resource quotas and limits is crucial to prevent any single service from monopolising cluster resources. Could you provide guidance on setting these parameters in a way that balances resource utilisation and availability, particularly for machine learning inference services that may have variable resource demands?

I think my answer to your first point should help here.

Node Pool Management: The deployment utilises a cluster with a fixed number of nodes. In a production scenario, how would you approach the management of node pools to accommodate the scaling of pods? Is there a strategy in place to scale the node pool itself, and if so, what are the considerations for such a strategy?

Similar to the first point. The better you can gauge the estimated traffic the better it will be for you. Then there are demography considerations too. That is to ask if your traffic is fairly distributed across the globe or is it concentrated. Based on that you'd want to upscale / downscale your pods and configure your load-balancer accordingly.

Cost Management: Lastly, could you share any insights on managing costs associated with running such a deployment on GKE? Are there any best practices or tools that you would recommend for monitoring and optimising the costs of the compute resources utilised by the Kubernetes cluster?

For this, I would recommend checking the official GCP guides. However, these days, you'd likely want to manage these deployments using a dedicated service like Vertex AI.

Cc: @deep-diver too.

sayakpaul / ml-deployment-k8s-fastapi

Inquiry Regarding Scalability Best Practices for FastAPI ML Model Deployment on Kubernetes #42