otterize / network-mapper

Map Kubernetes traffic: in-cluster, to the Internet, and to AWS IAM and export as text, intents, or an image
Apache License 2.0
612 stars 23 forks source link

Sniffer fails posting to http://otterize-network-mapper:9090 #246

Closed fjellvannet closed 1 day ago

fjellvannet commented 1 week ago

My sniffers won't properly start logging, they continuously experience this error: {"error":"Post \"http://otterize-network-mapper:9090/query\": context canceled","level":"error","msg":"Failed to report socket scan results","time":"2024-10-17T09:24:05Z"}

The otterize-network-mapper has been installed with the following command: helm upgrade --install network-mapper otterize/network-mapper -n otterize-system --create-namespace

My cluster has been created with kubeadm and v 1.31 and uses the cilium network driver, apart from that it is pretty "vanilla" with not much additional security policies installed. However, it is behind a company proxy. I tried to add the environment variables HTTP_PROXY, HTTPS_PROXY and NO_PROXY in the env-section of the deployment manually, but it did not make a difference. Also, why would the sniffers need to communicate with the internet?

image Look, here I have shelled into a sniffer and tested with wget if the url is reachable. Ofc I don't post the right payload, but at least we see that the dns-query resolves properly and that the url itself is reachable.

Do you have an idea where these errors may originate from? coredns works and runs in my cluster, and the main network-manager-pod is also up and running, only the sniffers continuously fail.

orishoshan commented 4 days ago

Indeed, the company proxy should not affect the sniffer - the traffic is internal to the cluster and between the sniffers and mapper. Let me look into this and figure out what other details we need from you that might help figure out what's going on.

orishoshan commented 1 day ago

Hey @fjellvannet, we were debugging this with another person who had the same issue, and the cause was that he had deployed Otterize twice in two different namespaces, which caused issues for the sniffers. Is there any chance that's also the case for you?

If not, would you be open to joining our Slack and taking a look together, as this issue seems a bit elusive?

To be clear, even if you deploy Otterize twice, the sniffers should still work, or if not, log a better error and fail the health check -- and we are checking why it didn't do that. It is possible that because the sniffers mount folders on the host and use host network that there is some unexpected interaction.

orishoshan commented 1 day ago

We think we have a fix, or one that will at least surface the actual error message. It seems that in some startup failures, the sniffer would not correctly shut down, and instead failed with this "context canceled" message. Try the latest version of Otterize and see if it helps.

fjellvannet commented 1 day ago

I tested again now.

Veriefied otterize is not running on any other namespace / anywhere else in the cluster.

And used the newest version of the helm chart (helm repo update) with no particular options.

Now I get errors a la this in the sniffers: context canceled. Failed to report capture results / Failed to report socket scan results

orishoshan commented 1 day ago

Updating here - we talked on Slack :) It seems that the conflict was with an instance of Tetragon that was using the same port on the host. Until a more permanent solution is found, for now we've made the sniffer able to function even if the metrics port is in use -- metrics will simply be disabled in that case.