vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.73k stars 1.57k forks source link

Improve `starts_with` and `ends_with` performance #6680

Closed jszwedko closed 1 year ago

jszwedko commented 3 years ago

While optimizing a user script:

      if starts_with(.log, "{") {
         .log = parse_json(.log)
         .log_type = "json"
      } else {
        .log_type = "text"
      }

The question of whether the starts_with guard was useful in reducing the performance impact of just calling parse_json all the time. Running a quick benchmark makes it seem like this is not the case:

remap-functions/parse_json/error
                        time:   [275.43 ns 275.81 ns 276.18 ns]
                        thrpt:  [3.6208 Melem/s 3.6256 Melem/s 3.6306 Melem/s]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
remap-functions/starts_with/error
                        time:   [3.2642 us 3.2687 us 3.2732 us]
                        thrpt:  [305.52 Kelem/s 305.93 Kelem/s 306.35 Kelem/s]

Diff of the benches I added:

diff --git a/lib/vrl/stdlib/benches/benches.rs b/lib/vrl/stdlib/benches/benches.rs
index e96383eb9..785c7943e 100644
--- a/lib/vrl/stdlib/benches/benches.rs
+++ b/lib/vrl/stdlib/benches/benches.rs
@@ -734,6 +734,11 @@ bench_function! {
         args: func_args![value: r#"{"key": "value"}"#],
         want: Ok(value!({key: "value"})),
     }
+
+    error {
+        args: func_args![value: r#"level=info msg="Stopping all fetchers" tag=stopping_fetchers id=ConsumerFetcherManager-1382721708341 module=kafka.consumer.ConsumerFetcherManager"#],
+        want: Ok(value!({})),
+    }
 }

 bench_function! {
@@ -1032,6 +1037,11 @@ bench_function! {
         args: func_args![value: "abcdefg", substring: "ABC", case_sensitive: false],
         want: Ok(value!(true)),
     }
+
+    error {
+        args: func_args![value: r#"level=info msg="Stopping all fetchers" tag=stopping_fetchers id=ConsumerFetcherManager-1382721708341 module=kafka.consumer.ConsumerFetcherManager"#, substring: "{"],
+        want: Ok(value!(false)),
+    }
 }

 bench_function! {

My best guess for this is because we always convert the value and substring to UTF-8 via try_bytes_utf8_lossy for the comparison. I think we can probably implement these two functions without converting the bytes to a string and instead just compare them as-is.

jszwedko commented 3 years ago

This is low priority, but just figured I'd record it.

StephenWakely commented 3 years ago

contains is another one.

I suspect that there are a number of functions in Remap that do try_bytes_utf8_lossy().into_owned() that with a bit of refactoring could avoid the clone. It would be a useful exercise now that we have some good benchmarks to look through each function and see what could be enhanced. The cost would likely be increased code complexity.

JeanMertz commented 3 years ago

Adding this to the backlog, as a generic "remove any unneeded string allocations" ticket, starting with the above-mentioned functions.

jszwedko commented 1 year ago

Closing as at least partially addressed by https://github.com/vectordotdev/vector/pull/10296. We can let user demand drive further improvements.