NETOBSERV-1248 deduper using infinispan cache

jpinsonneau commented 8 months ago

Description

Implements infinispan shared cache:

[X] Add cache functions for deduper case Keeps Egress flows between two nodes in cache Then merge interfaces and ifdirections when Ingress is captured
[X] Make url configurable
[ ] Auth / tls
[ ] Implements extra cases

Merged flows contains a list of interfaces / ifdirections with a * next to names coming from Egress flow: Screenshot from 2024-03-18 20-19-43

Steps to use this code:

go mod tidy && go mod vendor
Install Infinispan Operator OR Red Hat Data Grid Operator

Create cache cluster in netobserv namespace with disabled TLS / Auth

apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
name: cache
namespace: netobserv
spec:
security:
endpointAuthentication: false
endpointEncryption:
  clientCert: None
  type: None
endpointSecretName: cache-generated-secret
container:
memory: 1Gi
service:
replicationFactor: 2
type: Cache
jmx: {}
configListener:
enabled: true
logging:
  level: info
upgrades:
type: Shutdown
version: 8.4.6-2
replicas: 1

Create deduper cache

apiVersion: infinispan.org/v2alpha1
kind: Cache
metadata:
name: deduper
namespace: netobserv
spec:
clusterName: cache
name: deduper
updates:
strategy: retain

These will provide a service cache.netobserv.svc.cluster.local on port 11222.

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

[ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
[ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
[ ] Does this PR require product documentation?
- [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
[ ] Does this PR require a product release notes entry?
- [ ] If so, fill in "Release Note Text" in the JIRA.
[ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
- [ ] If so, make sure it is described in the JIRA ticket.
QE requirements (check 1 from the list):
- [ ] Standard QE validation, with pre-merge tests unless stated otherwise.
- [ ] Regression tests only (e.g. refactoring with no user-facing change).
- [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

openshift-ci[bot] commented 8 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

openshift-ci[bot] commented 8 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from jpinsonneau. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/netobserv/flowlogs-pipeline/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

jotak commented 8 months ago

The new version is better but still increases the overall resource usage in my tests, although it's decrease for Loki & storage: https://docs.google.com/spreadsheets/d/1qakBaK1dk_rERO30k1cSR4W-Nn0SXW4A3lqQ1sZC4rE/edit#gid=705512397

(By overall I mean: everything deployed in netobserv* namespaces - so that includes the infinispan deployment and loki)

CPU overall +146% (1.68) Memory overall +30% (3GB)

Loki CPU -44% (0.035) Loki memory -46% (652MB) Loki storage -53%

So the benefit here is really on storage, and comes at a price in CPU and memory. We could run tests at a bigger scale to check if it's different.

Also one thing that seems wrong from my tests is that the captured traffic has dropped quite a lot, as if we were removing more flows than expected. That's to investigate.

jpinsonneau commented 7 months ago

Rebased this PR and added CacheEndpoint config to condition cache usage

openshift-merge-robot commented 7 months ago

PR needs rebase.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

jpinsonneau commented 3 months ago

@jotak is this something we should still consider ?

I can replace the json Marshal / Unmarshal by fmt.Sprintf to improve FLP CPU usage but we will still have the infinispan resource impact :thinking: If you think it's not worth I will just close that PR.

netobserv / flowlogs-pipeline