otterize / network-mapper

Map Kubernetes traffic: in-cluster, to the Internet, and to AWS IAM and export as text, intents, or an image
Apache License 2.0
612 stars 23 forks source link

Service incorrectly appearing in every list of calls #151

Closed abohne closed 8 months ago

abohne commented 12 months ago

I'm running network-mapper 1.0.4 and am seeing a service incorrectly in the list of calls for every service listed. E.g.

zookeeper in namespace kafka calls:
  - unused-service in namespace example
  - zookeeper in namespace kafka
kafka in namespace kafka calls:
  - unused-service in namespace example
  - zookeeper in namespace kafka
prometheus in namespace monitoring calls:
  - unused-service in namespace example
  - kafka in namespace kafka
  - zookeeper in namespace kafka

In the example above, there should be no calls from zookeeper or kafka to unused-service. Is there a way to drill down and determine why a certain service appears in the call list?

orishoshan commented 12 months ago

Hey @abohne!

Without any further configuration, the network mapper uses DNS traffic and /proc/net entries to determine connections between pods and pods, as well as pods and services.

There are various situations where pod IP addresses can be re-used, in particular on AWS EKS. In practice, the network mapper has mitigations that deal with these cases, comparing the hostname of the pod that is reported from the control plane with the hostname seen in /proc on the node by the network sniffer pods. This is just an example and I don't believe this is what's happening in your case, but just to give a bit of background. :)

To understand this better, could you share some more information? In particular:

  1. A set of Kubernetes YAMLs that result in this situation, if you have them. Is the example in the issue a synthetic deployment you created to reproduce this problem, or is it part of a larger deployment that exhibits the problem?
  2. Are there constantly pods and services going up and down in this deployment, or is it largely static? This can help understand if the network mapper is misresolving pods or services because of stale cache entries, or because of another issue. A static deployment strongly suggests that it's unrelated to the cache.
  3. Which Kubernetes distribution are you using? Such as which cloud provider, or if it's self-managed Kubernetes, which flavor.
  4. Which CNI are you using and which version?

If you are not sure how to answer some of these questions, happy to chat on Slack (join us at https://joinslack.otterize.com) or hop on a short Zoom to find the answers together. :)

orishoshan commented 12 months ago

Hey @abohne, checking again if you can provide some more details :)

abohne commented 11 months ago

Thanks for getting back to me @orishoshan. Suggestion 2 ended up being very helpful. We have a namespace with a bunch of services that I discovered are currently stuck in CrashLoopBackOff. I ended up telling istio to exclude that namespace, reset the network mapper state, and let things run for a while. The problematic service doesn't show up anymore.

FWIW, we're running EKS with AWS VPC CNI and istio CNI chained.

orishoshan commented 11 months ago

That is a significant hint! I'm happy to hear it's working fine now, but that shouldn't happen no matter what was happening, but we'll try and reproduce this with Istio and look into whether the Istio network mapping (which works by using sidecar metrics) is susceptible to misresolving pods.

With Istio, we actually use the workload identity provided by Istio for the resolution rather than IP addresses, so it should in theory be more robust. We'll look into this.

abohne commented 11 months ago

To provide some more context, we weren't actually using the istio mapping component. The pods that were flapping were also not in the same namespace as the bogus service we were seeing.

orishoshan commented 11 months ago

I see. Was the bogus service part of the service mesh? Meaning, did it have a sidecar?

abohne commented 11 months ago

Yes, the bogus service was part of the service mesh.

orishoshan commented 11 months ago

Thanks @abohne! This helps a lot. We'll try and reproduce it using a similar scenario and Istio. Have you experienced this again since then?

abohne commented 11 months ago

I haven't seen the issue reappear since I fixed the flapping service.