Istio sidecars flood CoreDNS because of ExternalName services with ports that Knative creates for inference services

dimara commented 2 years ago

What version of Knative?

0.23.3

Summary

KNative creates ExternalName services for each inference service for redirecting traffic to Istio IngressGateway. For each such service, all Istio sidecars, every 5 seconds, will try to resolve the specified DNS target. As a result CoreDNS gets flooded. This is even worse on EKS where you have ndots: 5 and ec2.internal in search domains, that is, each DNS query results to five with one of them getting forwarded to AWS nameservers outside the cluster.

Steps to Reproduce the Problem

We use Knative Serving 0.23.3 with Istio 1.9.6 and KFServing 0.6.1.
We create an InferenceService with a transformer and a predictor.
KFServing creates an ExternalName service without ports configuration. In our setup it points to knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local.
KNative creates two ExternalName services with ports configuration (see https://github.com/knative/serving/commit/09986741f0bc6e369ed99370728ad715054656d5). In our setup it points to knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local.
Istio will create a STRICT_DNS cluster only for ExternalName services with ports configuration (see also https://github.com/istio/istio/issues/23463, https://github.com/istio/istio/issues/37331).
All Istio sidecars (envoy) running in the cluster, for each STRICT_DNS cluster, they will try to resolve the specified DNS target every 5 seconds (see https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/service_discovery#strict-dns). In our setup this is knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local.
On EKS because of ndots: 5 this will result in 5 DNS requests (4 NXDOMAIN for each search domains and one NOERROR) (see also https://discuss.istio.io/t/flood-of-nxdomain-lookups-to-coredns-from-istio-sidecar/11588)
On EKS, because the last seach domain is ec2.internal (on us-east-1), one of the above DNS requests will be forwarded to AWS nameservers that will respond with NXDOMAIN.

In a cluster with lots of pods with Istio sidecars and lots of inference services CoreDNS gets flooded with DNS requests. We have seen it going into CrashLoopBackoff and getting i/o timeout when talking to AWS nameservers:

[ERROR]: plugins/error 2 knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local.ec2.internal. A: read udp 10.52.76.80:XXX -> 10.52.0.2:53: i/o timeout

dimara commented 2 years ago

Any news on this? @mattmoor What is the rational for creating ExternalName services with ports (see https://github.com/knative/serving/commit/09986741f0bc6e369ed99370728ad715054656d5)? Should we consider the observed behavior an Istio bug, that is the fact that ExternalName services with ports are handled differently and we end up with tons of DNS queries from Istio sidecars?

kyue1005 commented 2 years ago

we are having similar issue, would be great if fixing that

szelenka commented 2 years ago

I'm observing the same issue in Istio without Knative. Seems like an Istio bug IMO

kyue1005 commented 2 years ago

straight forward thinking, if ExternalName services with ports are the culprit, removing the port config would resolve that?

szelenka commented 2 years ago

@kyue1005 removing the port definition on ExternalName Service prevents this behavior, yes..

However, when you have Istio mTLS configured in STRICT mode, the Destination Rules wont have a port defined for those ExternalNames, and traffic to those ExternalName Services will not work. Defining a port on the ExternalName allows this to function, but Istio apparently goes nuts with resolving those names constantly .. DDoS`ing CoreDNS

kyue1005 commented 2 years ago

my temporary solution is to use a full qualified domain for my local gateway address to avoid the ndots search issue it does relieve the DNS loading a bit, but the root cause still lies in the STRICT_DNS issue hope there would be a fix on that soon

marcjimz commented 2 years ago

This is the exact issue we are running into. Seems at a certain number of KServices, say 100, we start to see DNS failures and coreDNS crushing under load. Strict mTLS is a requirement of our product. Any ideas on where and how we can target a fix?

daraghlowe commented 2 years ago

@kyue1005 where did you configure the fully qualified domain for the local gateway address? We're seeing a similar issue although we don't have sidecars in our environment.

kyue1005 commented 2 years ago

@daraghlowe I update the config-istio as below

local-gateway.knative-serving.knative-local-gateway: "knative-local-gateway.istio-system.svc.cluster.local."

elukey commented 1 year ago

@kyue1005 Hi! Does the change above need a full restart of the istio-proxy on the target pods? (so like delete/respawn of the pod or similar)

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

majeed828 commented 1 year ago

Came across Istio Smart DNS Proxy feature. Although the underlying issue of ExternalName service DNS resolution every 5 secs will still be there but it appears that after installing Istio Smart DNS Proxy the DNS queries are significantly reduced even with ndots:5 in /etc/resolv.conf. Any thoughts ?

https://istio.io/latest/blog/2020/dns-proxy/ https://istio.io/latest/docs/ops/configuration/traffic-management/dns-proxy/

"With Istio’s implementation of the CoreDNS style auto-path technique, the sidecar agent will detect the real hostname being queried within the first query and return a cname record to productpage.ns1.svc.cluster.local as part of this DNS response, as well as the A/AAAA record for productpage.ns1.svc.cluster.local. The application receiving this response can now extract the IP address immediately and proceed to establishing a TCP connection to that IP. The smart DNS proxy in the Istio agent dramatically cuts down the number of DNS queries from 12 to just 2!"

dprotaso commented 1 year ago

Is there an istio issue to track this perf problem?

edit - I just made one asking for recommendations - https://github.com/istio/istio/issues/44169

dprotaso commented 1 year ago

Hey folks it was pointed out here that the 5s sync is configurable here -https://github.com/istio/istio/issues/44169#issuecomment-1489830835

vsxen commented 1 year ago

The request interval should be 30s (coredn tll)

local-gateway is probably useless. Can ExternalName be removed?

hzxuzhonghu commented 10 months ago

As istio support local dns, if the external name is knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local(a cluster ip kubernets service).

The sidecar itself can serve the DNS, no redirecting to coreDNS

knative / serving