Open rhuss opened 2 years ago
For reference, I found this very interesting blog post which points to this presentation on how to Scaling Kubernetes to Support 50000 Services. It uses IPVS. No idea whether IPVS is supported by Kubernetes out of the box.
I think it is (though I don't have any experience with it): https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-ipvs
This is a significant limitation for the Knative serverless use case. If you consider that in a serverless environment, you may have ~5% of registered services active, you end up with 95% of the cluster idle. I would love to hear how others experience scale with the Knative serverless.
In some environments, the clusters are initialized with limit service CIDR besides the limit of iptables. The number of services is even smaller than 10000. If we support headless services, we can get rid of this limitation. I propose to offer an option in the configuration to turn on headless in order to not assign an ip to service.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/remove-lifecycle stale
Kubernetes Services that is based on the maximum number of iptable entries on a node
If this is accurate then I think for our private k8s service we can set the ClusterIP: None
. That private service is used for endpoint collection/tracking.
It's worth exploring in v1.12
So it could be that you have an empty cluster (everything scaled to zero), but your routing tables are still exhausted ?
Networking programming is the slowest so it's a trade-off between keeping these resources and cold starts.
Another optimization could be to spin down child resources for Revisions that are not reachable.
this is also interesting https://render.com/blog/knative
it seems that render did some fixes themselves and removed the services. I am just wondering could that be supported somehow for big knative clusters? Like we do have similar situation: we are always using loadbalancers (and ingress controller in front of), so 2 services at least are useless?
it seems that render did some fixes themselves and removed the services
Render were able to do this because their free tier only runs one pod per Knative Service - which is per tenant.
The private services are a means to collect endpoints of revision pods that we will use in the public service. The public service also is where we wire in the activator when the pods scale to zero or you need extra burst capacity.
Alternatively Knative could do the endpoint collection itself but then we're copying Kubernetes behaviour - unsure if we want to go down that path unless there's a lot of benefit.
I think dropping the ClusterIP would help a lot - someone just needs to open a PR and test it out. @zetaab are there other concerns with the private Knative Service that you have?
If the backend endpoint scale is single or the client side can handle the load balancing, maybe the ClusterIP of public service can be dropped like the Render. What about making the private service ClusterIP to None and create an option for the public service to be headless?
The public service is already headless and we manage those endpoints ourselves.
What about making the private service ClusterIP to None
I was suggesting this here - https://github.com/knative/serving/issues/13201#issuecomment-1680608428
Someone just needs to do that work :)
@dprotaso If headless means manually managing the endpoints, I think it is. But the ClusterIP of public service is not None, and it will consume a service ip if I understand correctly.
But the ClusterIP of public service is not None, and it will consume a service ip if I understand correctly.
Yeah I refering to setting the ClusterIP to None for the Private
service
@rhuss see @izabelacg's change - we disabled the clusterip on the private service. I believe for now it shouldn't be consuming any iptables entries. Can you confirm?
With that change we technically are just using the k8s service for endpoint collection only.
Would it be a desiarable goal to reduce the number of Kubernetes Services attached to a Knative Service? (like down to 1:1) ?
I don't think so - k8s services are considered the 'frontend' for routing so we'll always need one per revision.
We could remove the private service and do the collection ourselves - but that adds a ton of complexity to Serving and I'm not sure it's worth it at the moment.
Thanks a lot; that is definitely a vast improvement. Is there a way to indicate that some revisions should not be routable ? E.g. if you don't leverage a traffic split or don't want allow access to older revision (e.g. when they contain some bugs that are resolved by newer revisions). If so, I think that would be perfect because then we can always argue, if you "just" want autoscaling without any revisioning, then you consume as many services as without autoscaling (aka when using a vanilla K8s Deployment & Service)
Note - we had to revert the cluster ip changes because it broke the autoscaler pod/cluster ip scraping
Ask your question here:
According to Kubernetes Scalability thresholds there is an upper limit for the number of Kubernetes Services that is based on the maximum number of iptable entries on a node. Currently, this limit is at 10000 services (if I understand correctly, this is independent of the number of nodes in a cluster since every node needs to have the same iptables routing).
Since every Knative service translates at a minimum to 3 Kubernetes services (1 ExternalName service pointing to the ingress gateway, and 2 services for each revision (public/private), the theoretical maximum of Knative Services in a cluster would be ~ 3350 Knative services (and much less if using multiple revisions and/or other workloads on this cluster beside Knative).
My questions would be: