open-policy-agent / opa

Open Policy Agent (OPA) is an open source, general-purpose policy engine.
https://www.openpolicyagent.org
Apache License 2.0
9.3k stars 1.29k forks source link

bundle: Preallocate buffers for file contents. #6818

Closed philipaconrad closed 1 week ago

philipaconrad commented 2 weeks ago

Why were the changes in this PR made?

This PR reduces the max RSS memory usage of OPA during bundle loading by about 5-6%, and greatly improves GC performance at lower values of GOMAXPROCS.

This improves OPA's performance particularly for container and low-resource use cases by reducing the peak of the spike in memory usage, and also reduces garbage collector overheads.

What are the changes in this PR?

This PR adds logic to preallocate buffers when loading files from both tarballs and on-disk bundle directories. The change results in lower max RSS memory usage at runtime, and better garbage collector performance, especially when at lower values of GOMAXPROCS.

For very large bundles (>1 GB in size), this change can lower startup times for OPA by as much as a full second.

Benchmarking Details & Instructions

Huge thanks to @charlesdaniels for help getting the Awk script working! It's very handy for the summary reports about what the GC traces mean.

Instructions and scripts are provided in the expandable Details block below.

Scripts used for generating bundle contents/collecting results #### Generating worst-case, large bundles I used the following Python script to generate arbitrarily large, but highly compressible OPA bundles. `genbig.py`: ```python N = 100000 oneKBString = "A" * 1024 if __name__ == "__main__": print("{") for i in range(0, N): print('"{}": "{}",'.format(i, oneKBString)) print('"{}": "{}"'.format(N, oneKBString)) print("}") ``` - Usage: `python3 genbig.py > data.json` (Adjust the `N` variable to dial in the size in KB you want the file to be.) #### Getting `pprof` results quickly from an OPA server `profile.sh` ```bash #!/bin/bash SUFFIX=$1 curl -sK -v http://localhost:8181/debug/pprof/heap > heap-${SUFFIX}.prof curl -sK -v http://localhost:8181/debug/pprof/allocs > allocs-${SUFFIX}.prof go tool pprof -lines -png -output heap-${SUFFIX}.png heap-${SUFFIX}.prof go tool pprof -lines -png -output allocs-${SUFFIX}.png allocs-${SUFFIX}.prof ``` - Usage: `./profile.sh pr-1gb` (Generates `allocs-pr-1gb.png` and `heap-pr-1gb.png`) I ran OPA with the `pprof` endpoints enabled, using the CLI commands shown earlier in the graph sections. After collecting the pprof traces and generating the graphs with `profile.sh`, I then switched over to garbage collector trace work. #### Collecting and summarizing GC trace results With some help from @charlesdaniels, I put together this Awk script for reporting the aggregate timing stats for a GC trace log. `summarize-gctrace.awk`: ```awk # AWK script for summarizing GODEBUG=gctrace=1 logs. # Known to work for Go 1.22, but the log format is subject to change. # The format is the following (pulled from the Go `runtime` package docs): # gc # @#s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # MB stacks, #MB globals, # P # ^$5 ^$10 ^$28 # Example: # gc 11 @3.169s 1%: 0.025+509+0.017 ms clock, 0.10+0/261/0.81+0.068 ms cpu, 3585->3585->2561 MB, 3585 MB goal, 0 MB stacks, 0 MB globals, 4 P BEGIN { FS = "[\t +/]+" } # Some simple assertions to prevent matching non-gctrace logs. $1 != "gc" { non_matching_rows++; next; } $9 == "cpu" { non_matching_rows++; next; } # The main pattern-action block: { num_threads = $28 clock_stw_sweep_term += $5 clock_conc_mark_scan += $6 clock_stw_mark_term += $7 total_clock += $5 + $6 + $7 cpu_ms_stw_sweep_term += $10 cpu_ms_assist_gc += $11 cpu_ms_background_gc += $12 cpu_ms_idle_gc += $13 cpu_ms_stw_mark_term += $14 total_cpu += $11 + $12 + $13 } END { print "--- Wall clock time summary -------------------" printf " STW sweep termination ......... %10.3f ms\n", clock_stw_sweep_term printf " Concurrent mark and scan ...... %10.3f ms\n", clock_conc_mark_scan printf " STW mark termination .......... %10.3f ms\n", clock_stw_mark_term printf " Total = %10.3f ms (%.3f s)\n", total_clock, total_clock / 1000 print "" print "--- CPU time breakdown ------------------------" printf " STW sweep termination ......... %10.3f ms\n", cpu_ms_stw_sweep_term printf " Mark/Scan assist time ....... %10.3f ms\n", cpu_ms_assist_gc printf " Mark/Scan background GC time %10.3f ms\n", cpu_ms_background_gc printf " Mark/Scan idle GC time ...... %10.3f ms\n", cpu_ms_idle_gc printf " STW mark termination .......... %10.3f ms\n", cpu_ms_stw_mark_term printf " Total = %10.3f ms (%.3f s) (Does not include STW phases)\n", total_cpu, total_cpu / 1000 print "" print "Num threads:", num_threads print "Non-matching rows:", non_matching_rows } ``` The script can be run in-line with OPA by `tee`-ing stderr over to the Awk process, and then `pkill`'ing OPA with a `SIGINT` signal. Example with the PR build of OPA: Terminal 1: ``` $ GODEBUG=gctrace=1 ./opa-pr run -s -b bundle-1gb.tar.gz --pprof 2> >(tee >(awk -f summarize-gctrace.awk >&2) >&2) ``` Terminal 2: ``` # After waiting for the OPA startup message to appear in Terminal 1: $ pkill -SIGINT opa-pr ```

Performance Analysis and Benchmarking Results


Results from main:

CLI command: /usr/bin/time -v ./opa-main run -s -b bundle-1gb.tar.gz --pprof

Results
Printout from time (note Maximum resident set size is ~3.8 GB) ``` Command being timed: "./opa-main run -s -b bundle-1gb.tar.gz --pprof" User time (seconds): 7.74 System time (seconds): 1.41 Percent of CPU this job got: 61% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.79 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 3834624 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 953708 Voluntary context switches: 3322 Involuntary context switches: 66 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ```
GODEBUG=gctrace=1 logs from main This is a representative sample of loading a 1GB bundle with `GODEBUG=gctrace=1` enabled. ``` gc 1 @0.019s 2%: 0.14+1.3+0.045 ms clock, 1.1+0.50/1.9/1.4+0.36 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 2 @0.024s 2%: 0.028+4.3+0.039 ms clock, 0.22+0/1.6/5.8+0.31 ms cpu, 3->3->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 3 @0.034s 2%: 0.058+10+0.016 ms clock, 0.47+0/1.7/12+0.13 ms cpu, 8->8->7 MB, 8 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 4 @0.046s 2%: 0.026+8.1+0.023 ms clock, 0.21+0/1.0/9.3+0.18 ms cpu, 15->15->13 MB, 15 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 5 @0.056s 2%: 0.014+10+0.012 ms clock, 0.11+0/0.63/11+0.096 ms cpu, 29->29->25 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 6 @0.070s 1%: 0.024+22+0.041 ms clock, 0.19+0/0.98/22+0.33 ms cpu, 57->57->49 MB, 57 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 7 @0.100s 1%: 0.034+38+0.003 ms clock, 0.27+0/0.61/38+0.025 ms cpu, 113->113->97 MB, 113 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 8 @0.152s 0%: 0.024+70+0.097 ms clock, 0.19+0/0.58/71+0.78 ms cpu, 225->225->193 MB, 225 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 9 @0.250s 0%: 0.024+134+0.011 ms clock, 0.19+0/0.58/134+0.088 ms cpu, 449->449->385 MB, 449 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 10 @0.433s 0%: 0.023+268+0.053 ms clock, 0.18+0/0.54/268+0.42 ms cpu, 897->897->769 MB, 897 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 11 @0.801s 0%: 0.024+514+0.003 ms clock, 0.19+0/0.73/514+0.031 ms cpu, 1793->1793->1537 MB, 1793 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 12 @3.252s 0%: 0.028+504+0.011 ms clock, 0.22+0/252/0.87+0.090 ms cpu, 3585->3585->2561 MB, 3585 MB goal, 0 MB stacks, 0 MB globals, 8 P {"addrs":[":8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server. OPA is running on a public (0.0.0.0) network interface. Unless you intend to expose OPA outside of the host, binding to the localhost interface (--addr localhost:8181) is recommended. See https://www.openpolicyagent.org/docs/latest/security/#interface-binding","time":"2024-06-12T18:14:21-04:00"} ``` Summary stats: ``` --- Wall clock time summary ------------------- STW sweep termination ......... 0.447 ms Concurrent mark and scan ...... 1583.700 ms STW mark termination .......... 0.354 ms Total = 1584.501 ms (1.585 s) --- CPU time breakdown ------------------------ STW sweep termination ......... 3.540 ms Mark/Scan assist time ....... 0.500 ms Mark/Scan background GC time 262.850 ms Mark/Scan idle GC time ...... 1087.370 ms STW mark termination .......... 2.840 ms Total = 1350.720 ms (1.351 s) (Does not include STW phases) ```
Allocs graph from main ![allocs-main-1gb-tarball](https://github.com/open-policy-agent/opa/assets/1906841/c56a416a-8946-4d4e-aa35-abe96b847181)
Heap graph from main ![heap-main-1gb-tarball](https://github.com/open-policy-agent/opa/assets/1906841/aa6c3efd-f4ac-45c8-8df0-317f3aad9e05)

Results from this PR:

CLI command: /usr/bin/time -v ./opa-pr run -s -b bundle-1gb.tar.gz --pprof

Results
Printout from time (note Maximum resident set size is ~3.6 GB) ``` Command being timed: "./opa-pr run -s -b bundle-1gb.tar.gz --pprof" User time (seconds): 6.51 System time (seconds): 1.25 Percent of CPU this job got: 63% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.30 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 3636100 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 913402 Voluntary context switches: 2421 Involuntary context switches: 171 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ```
GODEBUG=gctrace=1 logs from PR This is a representative sample of loading a 1GB bundle with `GODEBUG=gctrace=1` enabled. ``` gc 1 @0.009s 3%: 0.023+0.67+0.18 ms clock, 0.18+0.21/1.1/0.86+1.5 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 2 @0.013s 3%: 0.021+0.46+0.015 ms clock, 0.16+0.18/0.74/1.3+0.12 ms cpu, 989->989->989 MB, 989 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 3 @1.696s 0%: 0.030+268+0.014 ms clock, 0.24+0/0.61/128+0.11 ms cpu, 2013->2013->1757 MB, 2013 MB goal, 0 MB stacks, 0 MB globals, 8 P gc 4 @6.114s 0%: 0.030+48+0.014 ms clock, 0.24+0.10/96/167+0.11 ms cpu, 3511->3525->2715 MB, 3515 MB goal, 0 MB stacks, 0 MB globals, 8 P {"addrs":[":8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server. OPA is running on a public (0.0.0.0) network interface. Unless you intend to expose OPA outside of the host, binding to the localhost interface (--addr localhost:8181) is recommended. See https://www.openpolicyagent.org/docs/latest/security/#interface-binding","time":"2024-06-12T18:14:42-04:00"} ``` Summary stats: ``` --- Wall clock time summary ------------------- STW sweep termination ......... 0.104 ms Concurrent mark and scan ...... 317.130 ms STW mark termination .......... 0.223 ms Total = 317.457 ms (0.317 s) --- CPU time breakdown ------------------------ STW sweep termination ......... 0.820 ms Mark/Scan assist time ....... 0.490 ms Mark/Scan background GC time 98.450 ms Mark/Scan idle GC time ...... 297.160 ms STW mark termination .......... 1.840 ms Total = 396.100 ms (0.396 s) (Does not include STW phases) ```
Allocs graph from PR ![allocs-pr-1gb-tarball](https://github.com/open-policy-agent/opa/assets/1906841/339ad6a8-a14e-4a15-9fe8-ab6443d20e1b)
Heap graph from PR ![heap-pr-1gb-tarball](https://github.com/open-policy-agent/opa/assets/1906841/779d376c-36ec-4583-b37f-73175a177041)
netlify[bot] commented 2 weeks ago

Deploy Preview for openpolicyagent ready!

Name Link
Latest commit 5839c4ba921f4f6707a996a7d6e7766d1c9535f6
Latest deploy log https://app.netlify.com/sites/openpolicyagent/deploys/6669e5f5a96c860008e37d99
Deploy Preview https://deploy-preview-6818--openpolicyagent.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] commented 2 weeks ago

Deploy Preview for openpolicyagent ready!

Name Link
Latest commit 48a4ea4560abe0b4c747133d323942c426485dd1
Latest deploy log https://app.netlify.com/sites/openpolicyagent/deploys/6672ca0d481fe70008e9d5bd
Deploy Preview https://deploy-preview-6818--openpolicyagent.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

philipaconrad commented 1 week ago

@johanfylling If you think a Golang benchmark or something would be valuable to add to this PR to help the case for merging, please LMK, and I'll allocate some time for that.

Thanks again for the helpful review comments! :smile:

srenatus commented 1 week ago

The change will improve the performance when loading large (giant) bundles...I think it might be hard to do this in a benchmark. The PR's instructions outline the procedure used to verify it...that might have to be sufficient for now, I suppose.

philipaconrad commented 1 week ago

@johanfylling @srenatus Benchmarks are up! :smile:

It took me a sec to find the right way to exercise the changes-- I wound up settling on calling Bundle.Read(), since that invokes all of the loader machinery behind the scenes, and allows the resulting buffers to be used directly by their intended downstream consumers.

Example CLI invocations of the new benchmarks:

go test -benchmem -run=^$ -bench ^BenchmarkTarballLoader$ github.com/open-policy-agent/opa/bundle -count=30 > tar.txt
go test -benchmem -run=^$ -bench ^BenchmarkDirectoryLoader$ github.com/open-policy-agent/opa/bundle -count=30 > dir.txt
Benchstat results for BenchmarkTarballLoader (N=30 runs each) - [main-tar.txt](https://github.com/user-attachments/files/15893348/main-tar.txt) - [pr-6818-tar.txt](https://github.com/user-attachments/files/15893349/pr-6818-tar.txt) ``` $ benchstat main-tar.txt pr-6818-tar.txt name old time/op new time/op delta TarballLoader/1000-8 0.00ns ±59% 0.00ns ±36% ~ (p=0.159 n=28+30) TarballLoader/10000-8 0.02ns ±36% 0.01ns ±19% -69.13% (p=0.000 n=30+28) TarballLoader/100000-8 0.09ns ±20% 0.04ns ± 4% -54.52% (p=0.000 n=30+27) TarballLoader/250000-8 0.18ns ± 6% 0.10ns ± 5% -43.45% (p=0.000 n=30+28) name old alloc/op new alloc/op delta TarballLoader/1000-8 0.00B 0.00B ~ (all equal) TarballLoader/10000-8 0.00B 0.00B ~ (all equal) TarballLoader/100000-8 0.00B 0.00B ~ (all equal) TarballLoader/250000-8 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta TarballLoader/1000-8 0.00 0.00 ~ (all equal) TarballLoader/10000-8 0.00 0.00 ~ (all equal) TarballLoader/100000-8 0.00 0.00 ~ (all equal) TarballLoader/250000-8 0.00 0.00 ~ (all equal) ```
Benchstat results for BenchmarkDirectoryLoader (N=30 runs each) - [main-dir.txt](https://github.com/user-attachments/files/15893352/main-dir.txt) - [pr-6818-dir.txt](https://github.com/user-attachments/files/15893351/pr-6818-dir.txt) ``` $ benchstat main-dir.txt pr-6818-dir.txt name old time/op new time/op delta DirectoryLoader/10000-8 0.01ns ±18% 0.00ns ±83% -67.90% (p=0.000 n=27+30) DirectoryLoader/100000-8 0.06ns ± 8% 0.02ns ±15% -68.65% (p=0.000 n=29+29) DirectoryLoader/250000-8 0.11ns ±16% 0.05ns ±13% -55.78% (p=0.000 n=29+27) DirectoryLoader/500000-8 0.21ns ± 6% 0.10ns ±18% -55.17% (p=0.000 n=26+27) name old alloc/op new alloc/op delta DirectoryLoader/10000-8 0.00B 0.00B ~ (all equal) DirectoryLoader/100000-8 0.00B 0.00B ~ (all equal) DirectoryLoader/250000-8 0.00B 0.00B ~ (all equal) DirectoryLoader/500000-8 1.00B ± 0% 0.00B -100.00% (p=0.000 n=30+30) name old allocs/op new allocs/op delta DirectoryLoader/10000-8 0.00 0.00 ~ (all equal) DirectoryLoader/100000-8 0.00 0.00 ~ (all equal) DirectoryLoader/250000-8 0.00 0.00 ~ (all equal) DirectoryLoader/500000-8 0.00 0.00 ~ (all equal) ```