Closed abohne closed 8 months ago
Hey @abohne!
Without any further configuration, the network mapper uses DNS traffic and /proc/net entries to determine connections between pods and pods, as well as pods and services.
There are various situations where pod IP addresses can be re-used, in particular on AWS EKS. In practice, the network mapper has mitigations that deal with these cases, comparing the hostname of the pod that is reported from the control plane with the hostname seen in /proc on the node by the network sniffer pods. This is just an example and I don't believe this is what's happening in your case, but just to give a bit of background. :)
To understand this better, could you share some more information? In particular:
If you are not sure how to answer some of these questions, happy to chat on Slack (join us at https://joinslack.otterize.com) or hop on a short Zoom to find the answers together. :)
Hey @abohne, checking again if you can provide some more details :)
Thanks for getting back to me @orishoshan. Suggestion 2 ended up being very helpful. We have a namespace with a bunch of services that I discovered are currently stuck in CrashLoopBackOff
. I ended up telling istio to exclude that namespace, reset the network mapper state, and let things run for a while. The problematic service doesn't show up anymore.
FWIW, we're running EKS with AWS VPC CNI and istio CNI chained.
That is a significant hint! I'm happy to hear it's working fine now, but that shouldn't happen no matter what was happening, but we'll try and reproduce this with Istio and look into whether the Istio network mapping (which works by using sidecar metrics) is susceptible to misresolving pods.
With Istio, we actually use the workload identity provided by Istio for the resolution rather than IP addresses, so it should in theory be more robust. We'll look into this.
To provide some more context, we weren't actually using the istio mapping component. The pods that were flapping were also not in the same namespace as the bogus service we were seeing.
I see. Was the bogus service part of the service mesh? Meaning, did it have a sidecar?
Yes, the bogus service was part of the service mesh.
Thanks @abohne! This helps a lot. We'll try and reproduce it using a similar scenario and Istio. Have you experienced this again since then?
I haven't seen the issue reappear since I fixed the flapping service.
I'm running network-mapper 1.0.4 and am seeing a service incorrectly in the list of calls for every service listed. E.g.
In the example above, there should be no calls from
zookeeper
orkafka
tounused-service
. Is there a way to drill down and determine why a certain service appears in the call list?