Open hihellobolke opened 7 months ago
Congratulations for contributing your first flowlogs-pipeline issue
Hello @hihellobolke , thanks for reaching out
We are indeed exploring options for a shared cache, however before coming there, we don't recommend the daemonset approach for deploying netobserv on large clusters, as you see this doesn't scale very well. It's recommended to use the Kafka deployment model instead (in FlowCollector, setting spec.deploymentModel
to Kafka
as documented here). This way, FLP is deployed as a Deployment that you can scale up & down.
Would that work for you?
Speaking of the proposed solution, I think this could work indeed. We have a PoC that introduces Infinispan as a distributed cache (though in that context it was for another purpose, not for Kube API caching)
Another approach could be to not use k8s informers in FLP, and use k8s watches instead. Because a problem with k8s informers is they cache whole gvks and not just the queried resources, leading to higher memory consumption & traffic bandwidth with kube API. We did this already in the operator, with something we called "narrowcache": https://github.com/netobserv/network-observability-operator/pull/476 , to cut down memory usage.
A downside that would affect both options would be slower processing time on cache misses, as everything would be lazy-loaded in FLP.
While following https://docs.openshift.com/container-platform/4.12/network_observability/configuring-operator.html
And running flow logs pipeline with k8s enrichment on large clusters ~20k pods, the memory consumtion is huuuge... Now as this was running as daemonset, this kind of ddos' api server.
Would it be better for scaling to allow use some sort of shared k8s enrich cache for all flp... ?
Perhaps cache can be smarter like https://redis.io/docs/manual/client-side-caching/
In the end we had to recreate a custom grpc server that used shared cache to achieve network tracing on larger clusters.