netobserv / netobserv-ebpf-agent

Network Observability eBPF Agent
Apache License 2.0
127 stars 32 forks source link

NETOBSERV-1564: do not force flushing maps when rb is triggered #348

Closed jotak closed 3 months ago

jotak commented 3 months ago

Description

Flushing (without throttling) has a harmful effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated.

High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows, than trying to force using maps this way

Also, use errno as the reason for the metric

With this change + high stress scenario I'm seeing better CPU but more memory slightly increased:

Capture d’écran du 2024-06-14 17-14-26 Patch applied at 16:55

overall, -50% CPU and +10% memory

This should be tested against cluster-density-v2

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci-robot commented 3 months ago

@jotak: This pull request references NETOBSERV-1564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/netobserv/netobserv-ebpf-agent/pull/348): >## Description > >Flushing (without throttling) has a nefast effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated. > >High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows. > >Also, use errno as the reason for the metric > >With this change + high stress scenario I'm seeing better CPU but more memory slightly increased: > >![Capture d’écran du 2024-06-14 17-14-26](https://github.com/netobserv/netobserv-ebpf-agent/assets/2153442/0c81bafe-dc60-4b2f-a3de-1658b49db876) > >overall, -50% CPU and +10% memory > >This should be tested against cluster-density-v2 > > > >## Dependencies > > >n/a > >## Checklist > >If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that. > >* [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist. >* [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix _(in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes)._ >* [ ] Does this PR require product documentation? > * [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs. >* [ ] Does this PR require a product release notes entry? > * [ ] If so, fill in "Release Note Text" in the JIRA. >* [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc. > * [ ] If so, make sure it is described in the JIRA ticket. >* QE requirements (check 1 from the list): > * [ ] Standard QE validation, with pre-merge tests unless stated otherwise. > * [ ] Regression tests only (e.g. refactoring with no user-facing change). > * [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team). > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=netobserv%2Fnetobserv-ebpf-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 3 months ago

@jotak: This pull request references NETOBSERV-1564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/netobserv/netobserv-ebpf-agent/pull/348): >## Description > >Flushing (without throttling) has a nefast effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated. > >High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows. > >Also, use errno as the reason for the metric > >With this change + high stress scenario I'm seeing better CPU but more memory slightly increased: > >![Capture d’écran du 2024-06-14 17-14-26](https://github.com/netobserv/netobserv-ebpf-agent/assets/2153442/0c81bafe-dc60-4b2f-a3de-1658b49db876) >_Patch applied at 16:55_ > >overall, -50% CPU and +10% memory > >This should be tested against cluster-density-v2 > > > >## Dependencies > > >n/a > >## Checklist > >If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that. > >* [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist. >* [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix _(in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes)._ >* [ ] Does this PR require product documentation? > * [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs. >* [ ] Does this PR require a product release notes entry? > * [ ] If so, fill in "Release Note Text" in the JIRA. >* [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc. > * [ ] If so, make sure it is described in the JIRA ticket. >* QE requirements (check 1 from the list): > * [ ] Standard QE validation, with pre-merge tests unless stated otherwise. > * [ ] Regression tests only (e.g. refactoring with no user-facing change). > * [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team). > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=netobserv%2Fnetobserv-ebpf-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 3 months ago

@jotak: This pull request references NETOBSERV-1564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/netobserv/netobserv-ebpf-agent/pull/348): >## Description > >Flushing (without throttling) has a harmful effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated. > >High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows. > >Also, use errno as the reason for the metric > >With this change + high stress scenario I'm seeing better CPU but more memory slightly increased: > >![Capture d’écran du 2024-06-14 17-14-26](https://github.com/netobserv/netobserv-ebpf-agent/assets/2153442/0c81bafe-dc60-4b2f-a3de-1658b49db876) >_Patch applied at 16:55_ > >overall, -50% CPU and +10% memory > >This should be tested against cluster-density-v2 > > > >## Dependencies > > >n/a > >## Checklist > >If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that. > >* [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist. >* [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix _(in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes)._ >* [ ] Does this PR require product documentation? > * [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs. >* [ ] Does this PR require a product release notes entry? > * [ ] If so, fill in "Release Note Text" in the JIRA. >* [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc. > * [ ] If so, make sure it is described in the JIRA ticket. >* QE requirements (check 1 from the list): > * [ ] Standard QE validation, with pre-merge tests unless stated otherwise. > * [ ] Regression tests only (e.g. refactoring with no user-facing change). > * [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team). > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=netobserv%2Fnetobserv-ebpf-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 3 months ago

@jotak: This pull request references NETOBSERV-1564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/netobserv/netobserv-ebpf-agent/pull/348): >## Description > >Flushing (without throttling) has a harmful effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated. > >High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows, than trying to force using maps this way > >Also, use errno as the reason for the metric > >With this change + high stress scenario I'm seeing better CPU but more memory slightly increased: > >![Capture d’écran du 2024-06-14 17-14-26](https://github.com/netobserv/netobserv-ebpf-agent/assets/2153442/0c81bafe-dc60-4b2f-a3de-1658b49db876) >_Patch applied at 16:55_ > >overall, -50% CPU and +10% memory > >This should be tested against cluster-density-v2 > > > >## Dependencies > > >n/a > >## Checklist > >If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that. > >* [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist. >* [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix _(in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes)._ >* [ ] Does this PR require product documentation? > * [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs. >* [ ] Does this PR require a product release notes entry? > * [ ] If so, fill in "Release Note Text" in the JIRA. >* [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc. > * [ ] If so, make sure it is described in the JIRA ticket. >* QE requirements (check 1 from the list): > * [ ] Standard QE validation, with pre-merge tests unless stated otherwise. > * [ ] Regression tests only (e.g. refactoring with no user-facing change). > * [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team). > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=netobserv%2Fnetobserv-ebpf-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
github-actions[bot] commented 3 months ago

New image: quay.io/netobserv/netobserv-ebpf-agent:81e0239

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=81e0239 make set-agent-image
codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@88136bd). Learn more about missing BASE report. Report is 2 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #348 +/- ## ======================================= Coverage ? 33.33% ======================================= Files ? 48 Lines ? 3489 Branches ? 0 ======================================= Hits ? 1163 Misses ? 2229 Partials ? 97 ``` | [Flag](https://app.codecov.io/gh/netobserv/netobserv-ebpf-agent/pull/348/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/netobserv/netobserv-ebpf-agent/pull/348/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv) | `33.33% <0.00%> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/netobserv/netobserv-ebpf-agent/pull/348?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv) | Coverage Δ | | |---|---|---| | [pkg/flow/tracer\_ringbuf.go](https://app.codecov.io/gh/netobserv/netobserv-ebpf-agent/pull/348?src=pr&el=tree&filepath=pkg%2Fflow%2Ftracer_ringbuf.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv#diff-cGtnL2Zsb3cvdHJhY2VyX3JpbmdidWYuZ28=) | `25.00% <0.00%> (ø)` | |
Amoghrd commented 3 months ago

/label qe-approved

openshift-ci-robot commented 3 months ago

@jotak: This pull request references NETOBSERV-1564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/netobserv/netobserv-ebpf-agent/pull/348): >## Description > >Flushing (without throttling) has a harmful effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated. > >High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows, than trying to force using maps this way > >Also, use errno as the reason for the metric > >With this change + high stress scenario I'm seeing better CPU but more memory slightly increased: > >![Capture d’écran du 2024-06-14 17-14-26](https://github.com/netobserv/netobserv-ebpf-agent/assets/2153442/0c81bafe-dc60-4b2f-a3de-1658b49db876) >_Patch applied at 16:55_ > >overall, -50% CPU and +10% memory > >This should be tested against cluster-density-v2 > > > >## Dependencies > > >n/a > >## Checklist > >If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that. > >* [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist. >* [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix _(in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes)._ >* [ ] Does this PR require product documentation? > * [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs. >* [ ] Does this PR require a product release notes entry? > * [ ] If so, fill in "Release Note Text" in the JIRA. >* [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc. > * [ ] If so, make sure it is described in the JIRA ticket. >* QE requirements (check 1 from the list): > * [ ] Standard QE validation, with pre-merge tests unless stated otherwise. > * [ ] Regression tests only (e.g. refactoring with no user-facing change). > * [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team). > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=netobserv%2Fnetobserv-ebpf-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
msherif1234 commented 3 months ago

/lgtm

jotak commented 3 months ago

thanks @msherif1234 /approve

openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/netobserv/netobserv-ebpf-agent/blob/main/OWNERS)~~ [jotak] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment