netobserv / netobserv-ebpf-agent

Network Observability eBPF Agent
Apache License 2.0
119 stars 30 forks source link

WIP NETOBSERV-1550: Using batchAPIs to help with CPU and memory resources #256

Closed msherif1234 closed 3 months ago

msherif1234 commented 6 months ago

Description

cilium recently added batchAPI support for PerCPU maps this PR to migrate ebpf agent to use batchapis

https://github.com/cilium/ebpf/discussions/1315

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci-robot commented 6 months ago

@msherif1234: This pull request references NETOBSERV-559 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/netobserv/netobserv-ebpf-agent/pull/256): >## Description > >cilium recently added batchAPI support for PerCPU maps this PR to migrate ebpf agent to use batchapis > >## Dependencies > >n/a > >## Checklist > >If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that. > >* [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist. >* [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix _(in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes)._ >* [ ] Does this PR require product documentation? > * [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs. >* [ ] Does this PR require a product release notes entry? > * [ ] If so, fill in "Release Note Text" in the JIRA. >* [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc. > * [ ] If so, make sure it is described in the JIRA ticket. >* QE requirements (check 1 from the list): > * [ ] Standard QE validation, with pre-merge tests unless stated otherwise. > * [ ] Regression tests only (e.g. refactoring with no user-facing change). > * [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team). > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=netobserv%2Fnetobserv-ebpf-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 6 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from msherif1234. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/netobserv/netobserv-ebpf-agent/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
codecov[bot] commented 6 months ago

Codecov Report

Attention: Patch coverage is 0% with 90 lines in your changes are missing coverage. Please review.

Project coverage is 33.44%. Comparing base (b63f483) to head (edd8134). Report is 1 commits behind head on main.

:exclamation: Current head edd8134 differs from pull request most recent head 5fdf081. Consider uploading reports for the commit 5fdf081 to get more accurate results

Files Patch % Lines
pkg/ebpf/tracer_batchapis.go 0.00% 57 Missing :warning:
pkg/ebpf/tracer.go 0.00% 33 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #256 +/- ## ========================================== - Coverage 34.04% 33.44% -0.61% ========================================== Files 47 48 +1 Lines 3836 3905 +69 ========================================== Hits 1306 1306 - Misses 2444 2513 +69 Partials 86 86 ``` | [Flag](https://app.codecov.io/gh/netobserv/netobserv-ebpf-agent/pull/256/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/netobserv/netobserv-ebpf-agent/pull/256/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv) | `33.44% <0.00%> (-0.61%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

openshift-ci-robot commented 6 months ago

@msherif1234: This pull request references NETOBSERV-559 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/netobserv/netobserv-ebpf-agent/pull/256): >## Description > >cilium recently added batchAPI support for PerCPU maps this PR to migrate ebpf agent to use batchapis > >https://github.com/cilium/ebpf/discussions/1315 > >## Dependencies > >n/a > >## Checklist > >If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that. > >* [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist. >* [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix _(in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes)._ >* [ ] Does this PR require product documentation? > * [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs. >* [ ] Does this PR require a product release notes entry? > * [ ] If so, fill in "Release Note Text" in the JIRA. >* [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc. > * [ ] If so, make sure it is described in the JIRA ticket. >* QE requirements (check 1 from the list): > * [ ] Standard QE validation, with pre-merge tests unless stated otherwise. > * [ ] Regression tests only (e.g. refactoring with no user-facing change). > * [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team). > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=netobserv%2Fnetobserv-ebpf-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
msherif1234 commented 6 months ago

/ok-to-test

github-actions[bot] commented 6 months ago

New image: quay.io/netobserv/netobserv-ebpf-agent:6d184cc

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=6d184cc make set-agent-image
msherif1234 commented 6 months ago

/ok-to-test

github-actions[bot] commented 6 months ago

New image: quay.io/netobserv/netobserv-ebpf-agent:bfa5ac7

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=bfa5ac7 make set-agent-image
msherif1234 commented 6 months ago

ran scale based of 4.14 https://docs.google.com/spreadsheets/d/14taH8UGgjiLNqjgCRq66mcNBeNCDCnYEFGsur8Qig9I/edit#gid=2136829211

summary showing increase in ebpf resources

cpuEBPFTotals | cpuEBPFTotals | avg(value) | Fail | 78.57% | 3.405002158 | 6.080411792 |

rssEBPFTotals | rssEBPFTotals | avg(value) | Fail | 53.61% | 3404791063 | 5230041771 |

msherif1234 commented 6 months ago

image image

(pprof) top10 -cum
Showing nodes accounting for 70ms, 3.14% of 2230ms total
Dropped 56 nodes (cum <= 11.15ms)
Showing top 10 nodes out of 92
      flat  flat%   sum%        cum   cum%
         0     0%     0%     1770ms 79.37%  github.com/netobserv/netobserv-ebpf-agent/pkg/flow.(*MapTracer).evictFlows
         0     0%     0%     1770ms 79.37%  github.com/netobserv/netobserv-ebpf-agent/pkg/flow.(*MapTracer).evictionSynchronization
         0     0%     0%     1760ms 78.92%  github.com/netobserv/netobserv-ebpf-agent/pkg/ebpf.(*FlowFetcher).LookupAndDeleteMap
         0     0%     0%     1620ms 72.65%  github.com/cilium/ebpf.(*Map).BatchLookupAndDelete (inline)
         0     0%     0%     1620ms 72.65%  github.com/cilium/ebpf.(*Map).batchLookup
         0     0%     0%     1620ms 72.65%  github.com/cilium/ebpf.(*Map).batchLookupPerCPU
      40ms  1.79%  1.79%     1510ms 67.71%  github.com/cilium/ebpf/internal/sysenc.Unmarshal
      30ms  1.35%  3.14%     1350ms 60.54%  encoding/binary.Read
         0     0%  3.14%     1190ms 53.36%  github.com/cilium/ebpf.unmarshalBatchPerCPUValue
         0     0%  3.14%     1180ms 52.91%  github.com/cilium/ebpf.unmarshalPerCPUValue
(pprof) 
msherif1234 commented 5 months ago

added bench mark testing for iterate vs batchdelete api

$ go test ./pkg/ebpf/ -exec sudo -bench=BenchmarkFlowFetcher_LookupAndDeleteMap -benchmem -count 5 -run=^#
goos: linux
goarch: amd64
pkg: github.com/netobserv/netobserv-ebpf-agent/pkg/ebpf
cpu: Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12                  403       2507858 ns/op      757583 B/op       2943 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12                  446       2531754 ns/op      746563 B/op       2838 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12                  488       2234317 ns/op      737511 B/op       2753 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12                  526       2209894 ns/op      730663 B/op       2688 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12                  477       2251203 ns/op      739670 B/op       2774 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12                386       2796254 ns/op      598852 B/op       4355 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12                345       3105146 ns/op      613746 B/op       4492 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12                370       2940347 ns/op      604619 B/op       4406 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12                304       3723941 ns/op      631809 B/op       4664 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12                326       3699242 ns/op      621145 B/op       4566 allocs/op
PASS
ok      github.com/netobserv/netobserv-ebpf-agent/pkg/ebpf  70.103s
msherif1234 commented 5 months ago

started a repro upstream https://github.com/cilium/ebpf/pull/1343

msherif1234 commented 5 months ago

/ok-to-test

github-actions[bot] commented 5 months ago

New image: quay.io/netobserv/netobserv-ebpf-agent:aafaead

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=aafaead make set-agent-image
msherif1234 commented 5 months ago

after updating cilium to latest performance scale run not promising still https://docs.google.com/spreadsheets/d/1QaUwO841fIWUiKN8jZ4_d-c0Wpr3WykWEHmNuI1g3l8/edit#gid=1878323530

while benchmark showing better performance

 go test ./pkg/ebpf/ -exec sudo -bench=BenchmarkFlowFetcher_LookupAndDeleteMap -benchmem  -run=XXX
goos: linux
goarch: amd64
pkg: github.com/netobserv/netobserv-ebpf-agent/pkg/ebpf
cpu: Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12                 1212        944542 ns/op      642167 B/op       1849 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12                942       1196901 ns/op      478780 B/op       3214 allocs/op
PASS
ok      github.com/netobserv/netobserv-ebpf-agent/pkg/ebpf  7.538s
msherif1234 commented 5 months ago

/ok-to-test

github-actions[bot] commented 5 months ago

New image: quay.io/netobserv/netobserv-ebpf-agent:baad512

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=baad512 make set-agent-image
openshift-ci-robot commented 4 months ago

@msherif1234: This pull request references NETOBSERV-1550 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/netobserv/netobserv-ebpf-agent/pull/256): >## Description > >cilium recently added batchAPI support for PerCPU maps this PR to migrate ebpf agent to use batchapis > >https://github.com/cilium/ebpf/discussions/1315 > >## Dependencies > >n/a > >## Checklist > >If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that. > >* [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist. >* [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix _(in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes)._ >* [ ] Does this PR require product documentation? > * [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs. >* [ ] Does this PR require a product release notes entry? > * [ ] If so, fill in "Release Note Text" in the JIRA. >* [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc. > * [ ] If so, make sure it is described in the JIRA ticket. >* QE requirements (check 1 from the list): > * [ ] Standard QE validation, with pre-merge tests unless stated otherwise. > * [ ] Regression tests only (e.g. refactoring with no user-facing change). > * [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team). > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=netobserv%2Fnetobserv-ebpf-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
jotak commented 4 months ago

@msherif1234 I've created a new jira for this PR, NETOBSERV-1550, and the former is used for not-batched LookupAndDelete with my PR https://github.com/netobserv/netobserv-ebpf-agent/pull/283

msherif1234 commented 4 months ago

/ok-to-test

msherif1234 commented 4 months ago

/ok-to-test

codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 0% with 90 lines in your changes are missing coverage. Please review.

Project coverage is 33.44%. Comparing base (b63f483) to head (edd8134). Report is 1 commits behind head on main.

:exclamation: Current head edd8134 differs from pull request most recent head 5fdf081. Consider uploading reports for the commit 5fdf081 to get more accurate results

Files Patch % Lines
pkg/ebpf/tracer_batchapis.go 0.00% 57 Missing :warning:
pkg/ebpf/tracer.go 0.00% 33 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #256 +/- ## ========================================== - Coverage 34.04% 33.44% -0.61% ========================================== Files 47 48 +1 Lines 3836 3905 +69 ========================================== Hits 1306 1306 - Misses 2444 2513 +69 Partials 86 86 ``` | [Flag](https://app.codecov.io/gh/netobserv/netobserv-ebpf-agent/pull/256/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/netobserv/netobserv-ebpf-agent/pull/256/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv) | `33.44% <0.00%> (-0.61%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=netobserv#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

msherif1234 commented 4 months ago

/ok-to-test

github-actions[bot] commented 4 months ago

New image: quay.io/netobserv/netobserv-ebpf-agent:f8e7e13

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=f8e7e13 make set-agent-image
msherif1234 commented 3 months ago

I will close this PR as it never shows any real value switching to batchAPIs vs what we have today should we ever reconsider we can reopen it