Open dimara opened 2 years ago
Any news on this? @mattmoor What is the rational for creating ExternalName services with ports (see https://github.com/knative/serving/commit/09986741f0bc6e369ed99370728ad715054656d5)? Should we consider the observed behavior an Istio bug, that is the fact that ExternalName services with ports are handled differently and we end up with tons of DNS queries from Istio sidecars?
we are having similar issue, would be great if fixing that
I'm observing the same issue in Istio without Knative. Seems like an Istio bug IMO
straight forward thinking, if ExternalName services with ports are the culprit, removing the port config would resolve that?
@kyue1005 removing the port definition on ExternalName Service prevents this behavior, yes..
However, when you have Istio mTLS configured in STRICT
mode, the Destination Rules wont have a port defined for those ExternalNames, and traffic to those ExternalName Services will not work. Defining a port on the ExternalName allows this to function, but Istio apparently goes nuts with resolving those names constantly .. DDoS`ing CoreDNS
my temporary solution is to use a full qualified domain for my local gateway address to avoid the ndots search issue it does relieve the DNS loading a bit, but the root cause still lies in the STRICT_DNS issue hope there would be a fix on that soon
This is the exact issue we are running into. Seems at a certain number of KServices, say 100, we start to see DNS failures and coreDNS crushing under load. Strict mTLS is a requirement of our product. Any ideas on where and how we can target a fix?
@kyue1005 where did you configure the fully qualified domain for the local gateway address? We're seeing a similar issue although we don't have sidecars in our environment.
@daraghlowe I update the config-istio as below
local-gateway.knative-serving.knative-local-gateway: "knative-local-gateway.istio-system.svc.cluster.local."
@kyue1005 Hi! Does the change above need a full restart of the istio-proxy on the target pods? (so like delete/respawn of the pod or similar)
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
Came across Istio Smart DNS Proxy feature. Although the underlying issue of ExternalName service DNS resolution every 5 secs will still be there but it appears that after installing Istio Smart DNS Proxy the DNS queries are significantly reduced even with ndots:5 in /etc/resolv.conf. Any thoughts ?
https://istio.io/latest/blog/2020/dns-proxy/ https://istio.io/latest/docs/ops/configuration/traffic-management/dns-proxy/
"With Istio’s implementation of the CoreDNS style auto-path technique, the sidecar agent will detect the real hostname being queried within the first query and return a cname record to productpage.ns1.svc.cluster.local as part of this DNS response, as well as the A/AAAA record for productpage.ns1.svc.cluster.local. The application receiving this response can now extract the IP address immediately and proceed to establishing a TCP connection to that IP. The smart DNS proxy in the Istio agent dramatically cuts down the number of DNS queries from 12 to just 2!"
Is there an istio issue to track this perf problem?
edit - I just made one asking for recommendations - https://github.com/istio/istio/issues/44169
Hey folks it was pointed out here that the 5s sync is configurable here -https://github.com/istio/istio/issues/44169#issuecomment-1489830835
The request interval should be 30s (coredn tll)
local-gateway is probably useless. Can ExternalName be removed?
As istio support local dns, if the external name is knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local
(a cluster ip kubernets service).
The sidecar itself can serve the DNS, no redirecting to coreDNS
What version of Knative?
0.23.3
Summary
KNative creates ExternalName services for each inference service for redirecting traffic to Istio IngressGateway. For each such service, all Istio sidecars, every 5 seconds, will try to resolve the specified DNS target. As a result CoreDNS gets flooded. This is even worse on EKS where you have
ndots: 5
andec2.internal
in search domains, that is, each DNS query results to five with one of them getting forwarded to AWS nameservers outside the cluster.Steps to Reproduce the Problem
knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local
.knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local
.knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local
.ndots: 5
this will result in 5 DNS requests (4 NXDOMAIN for eachsearch
domains and one NOERROR) (see also https://discuss.istio.io/t/flood-of-nxdomain-lookups-to-coredns-from-istio-sidecar/11588)ec2.internal
(on us-east-1), one of the above DNS requests will be forwarded to AWS nameservers that will respond with NXDOMAIN.In a cluster with lots of pods with Istio sidecars and lots of inference services CoreDNS gets flooded with DNS requests. We have seen it going into CrashLoopBackoff and getting i/o timeout when talking to AWS nameservers: