vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.42k stars 1.51k forks source link

PGO applicability to Vector #15631

Open zamazan4ik opened 1 year ago

zamazan4ik commented 1 year ago

TL;DR: With PGO Vector got a boost from 300-310 k/s events to 350-370 k/s events!

Hi!

I am a big fan of PGO, so I've tried to use PGO with Vector. And I wanna share with you my current results. My hypothesis is the following: even for programs with LTO, PGO can bring HUGE benefits. And I decided to test it. From my experience, PGO especially works well with large codebases with some CPU-hot parts. Looks like Vector fits really well.

Test scenario

  1. Read a huge file with some logs
  2. Parse them
  3. Pass them to the blackhole.

This test scenario is completely real-life (except blackhole ofc :) ) and the log format with parse function are almost copied from our current prod env. We have patched flog tool to generate our log format (closed-source patch, sorry. I could publish it later if will be a need for it).

Example of one log entry: <E 2296456 point.server.session 18.12 19:17:36:361298178 processCall We need to generate the solid state GB interface! (from session.cpp +713)

So Vector config is the following (toml):

[sources.in]
type = "file"
include = [ "/Users/zamazan4ik/open_source/test_vector_logs/data/*" ]
read_from = "beginning"
file_key = "file"
data_dir = "/Users/zamazan4ik/open_source/test_vector_logs"

[transforms.parser]
type = "remap"
inputs = [ "in" ]
source = """
.message = parse_regex!(.message, r'<(?P<level>[EWD]) (?P<thread>.+?) (?P<tag>[a-z.]+) (?P<datetime>[\\d.]+ [\\d:]*) (?P<function>[\\S]+) (?P<mess>.*) \\(from (?P<file>[\\S.]*) \\+(?P<line>\\d+)\\)')
"""

[sinks.out]
type = "blackhole"
inputs = [ "parser" ]

[api]
  enabled = true

You could say: "Test scenario is too simple", but:

Test setup

Macbook M1 Pro with macOS Ventura 13.1 with 6+2 CPU on ARM (AFAIK) + 16 Gib RAM + 512 Gib SSD. Sorry, I have no Linux machine near with me right now nor a desire to test it on Linux VM or Asahi Linux setup. However, I am completely sure that results will be reproducible on the "usual" Linux-based x86-64 setup.

How to build

Vector already uses fat LTO for the release build. However, local Release build and Release build on CI are different since local Release build does not use fat LTO (since it's tooooooooooooooooooooooo time consuming). So, do not forget to add the following flags to your Release build (got them from scripts/environment/release-flags.sh):

codegen-units = 1
lto = "fat"

For performing PGO build for Vector I've used this nice wrapper: https://github.com/Kobzol/cargo-pgo . You could do it manually if you want - I am just a little bit lazy :)

The guide is simple:

Is it worth it?

Yes! At least in my case, I have got a huge boost: from 300-310 k/s events (according to vector top) with default Vector release build with LTO flags from CI to 350-370 k/s with the same build + PGO enabled. So at least in my case - it's a huge boost.

The comparison strategy is simple: run just LTOed Vector binary, then LTOed + PGOed Vector binary (with resetting file checkpoint ofc). And measure the total time before the whole file will be processed + track metrics via vector top during the execution.

Results are stable and reproducible. I have performed multiple runs in different execution orders with the same results.

So what?

So what could we do with it?

Possible future steps

Possible future steps for improving:

I hope the long read will be at least interesting for someone :) If you have any questions about it - just ask me here or on the official Vector Discord server (zamazan4ik nickname as well).

spencergilbert commented 1 year ago

Hey @zamazan4ik, thanks for the extensive writeup! I know we've discussed this in the past, but it seems like it was probably internally on Slack as I didn't find any related issues. I'm also pretty sure we had looked at that cargo-pgo project 😄.

I don't quite remember why we didn't move forward (I think even with testing it), but it's interesting to see your results here.

cc @jszwedko @tobz @blt, as I'm guessing y'all were involved with that original discussion.

tobz commented 1 year ago

Getting a 10-15% performance boost for essentially a bit of extra CI time per release is certainly an incredibly good trade-off. I think the biggest thing would just be, as you've pointed out, doing all of the legwork to figure out what platforms we can/can't do PGO on, and creating the CI steps to do it for release/nightly builds.

I'd also be curious to figure out what workload is the best for PGO profiles. As an example: are any of our current soak/regression test workloads better/worse than what you used when locally testing? That sort of thing.

zamazan4ik commented 1 year ago

doing all of the legwork to figure out what platforms we can/can't do PGO on

Well, actually PGO has a good state across all major platforms (Linux, macOS, Windows). Probably the best source of truth regarding PGO state in Rust ecosystem is rustc project itself since they are investing a lot of resources into optimizing the Rust compiler (e.g. Rust 1.66 enabled BOLT optimization additionally to PGO on Linux platform).

and creating the CI steps to do it for release/nightly builds

Yes, it will be the most time-consuming and boring stuff IMO. Also, do not forget about at least x2 in the build time (instrumentation build + run on the test workload + optimizing build).

I'd also be curious to figure out what workload is the best for PGO profiles.

From my experience, I would say the most beneficial parts should be CPU-heavy workloads (obviously). PGO shows good results on the huge programs where we have a lot of different possible branches with a huge context. In this case, the compiler cannot make a good guess about hot-cold branching, real-life inlining, etc. That's where PGO shines. Long short story, I do not expect much performance gains in the IO-workloads (e.g. posting to ElasticSearch) simply because the network usually is much-much slower than CPU, and even if we will get a performance speed up here - we will not see it in real life.

bruceg commented 1 year ago

I'd also be curious to figure out what workload is the best for PGO profiles.

I think that the ideal workload for a PGO profile should exercise all the components, or at least all the component subsystems, as there would be no benefit for those components that aren't exercised. It would probably be good to see some indication of code coverage with this too, something we are also lacking.

zamazan4ik commented 1 year ago

I think that the ideal workload for a PGO profile should exercise all the components, or at least all the component subsystems, as there would be no benefit for those components that aren't exercised. It would probably be good to see some indication of code coverage with this too, something we are also lacking.

Good suggestion. I just want to add that this work could be done in an iterative way: add baseline loads for the components step by step. In this case, we will be able to deliver PGO improvements incrementally and avoid waiting for completion work on the preparing baseline profile for all components at once.

zamazan4ik commented 1 year ago

@jszwedko do you want to mention PGO somewhere here in the documentation?

jszwedko commented 1 year ago

@jszwedko do you want to mention PGO somewhere here in the documentation?

That page is more about tuning the released Vector assets rather than recommendations that involve recompiling Vector.

I'd be happy to see us do this, but, as discussed above, it'll take some work.

zamazan4ik commented 1 year ago

@jszwedko I got some examples of how a PGO-oriented page could look like:

I think a similar approach could be used for Vector as well - just create a page with a dedicated note about PGO and put it in the Vector documentation.

jszwedko commented 1 year ago

Thanks for the links @zamazan4ik ! I've come around and agree that we could add this to the docs for advanced users that are able to compile Vector themselves and run example workloads. I could see it being a subpage under https://vector.dev/docs/administration/tuning/. Feel free to open a PR if you like 🙂

zamazan4ik commented 1 year ago

I did some benchmarks LTO, PGO and BOLT benchmarks on Linux and want to share my numbers. The test scenario is completely the same as in https://github.com/vectordotdev/vector/issues/15631#issue-1502073978 .

Setup

My setup is the following:

Results

Unfortunately, I didn't manage to test LTO + PGO on Linux since on the current Rust version it's broken for some unknown yet reasons (see https://github.com/Kobzol/cargo-pgo/issues/32 and https://github.com/llvm/llvm-project/issues/57501 for further details). Hopefully, this will be fixed in the future.

So I did some measurements on different LTO configurations with BOLT. The provided time is the time to complete the test scenario (process the same input file with file source and do some heavy regex-based transforms). The results are the following:

According to the results above, there are several conclusions:

@bruceg pinging you since you asked me regarding BOLT for Vector.

bruceg commented 1 year ago

Thanks for this writeup, @zamazan4ik, that's great to see. Did lto = "fat" + PGO optimized work at all for you or did you hit a bug there? Is there an open issue regarding LTO + PGO in rustc?

zamazan4ik commented 1 year ago

Did lto = "fat" + PGO optimized work at all for you or did you hit a bug there? Is there an open issue regarding LTO + PGO in rustic?

Nope, it doesn't work right now due to a compilation error in the "LTO + PGO" combination. I've created the issue in https://github.com/Kobzol/cargo-pgo/issues/32 and added a comment to LLVM possibly-related bug in https://github.com/llvm/llvm-project/issues/57501#issuecomment-1694455863 . I didn't create an issue yet about this behavior in rustc repo (maybe @kobzol can add some details regarding the issue). If not - I will create an issue in rustc issue tracker as well.

zamazan4ik commented 1 year ago

Bug in the upstream regarding LTO + PGO: https://github.com/rust-lang/rust/issues/115344

gonchik commented 6 months ago

Hi! do you have some tests for windows? I have sample there are
image

rename path operation quite slow by vector.exe - I don't know why WPA looks that.

I will try to check with PGO profile