Extensive regex matching leads to spinning locks contention

dizlv commented 2 years ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I have encountered strange performance degradation during Vector upgrade from 0.18 to 0.24.2. After dozens of performance tests and investigations I think I have found the issue.

Tests were performed on C5.9xlarge Instance (AWS). Vector is launched in Docker. At the end of the topology we have some log_to_metric components that have multiple tags, something like this:

[[transforms.transform_name.metrics]]
field = 'field-name'
name = 'name'
type = 'counter'

tags.tag1 = '{{tags.tag1}}'
tags.tag2 = '{{tags.tag2}}'
tags.tag3 = '{{tags.tag3}}'

...
tags.tag9 = '{{tags.tag9}}'

I tried to bisect the version in which this degradation started to happen. So the first time I found it was in 0.21. 0.18, 0.19 and 0.20 worked fine, but this might be simply due to the fact that I did not manage to apply the same workload on them, whereas starting from 0.21 Vector is capable of handling way more workload.

Next step was to bisect components in the topology. At the end I managed to deduce, that once I disable log_to_metric – instance works just fine, but should I enable log_to_metric and CPU usage goes up to 100% and component buffers are exploding.

From that point on I understood, that something is clearly wrong with that and decided to check what type of workload actually CPU is handling.

perf-kernel-2

Here we see that 81% of whole CPU is being spent in log_to_metric component, more specifically in render_tags that looks like this:

From the Flamegraph we also see that last time before locks are thrown into the regex matching is executed in template TryFrom trait implementation. So we follow the code logic from render_tags. Apparently for every single “tag” (which can be counter, histogram, etc) we execute another function render_template:

Within render_template we see try_from with reference to the string, which triggers TryFrom trait. Following implementation:

Some related links:

Commit, where spinlocks were introduced
Discussion 1 and 2 on the reasoning behind this implementation

I managed to reduce locks congestion by limiting vector threads amount. The second option is to remove this component from the topology.

Worth mentioning that this degradation has been observed in every release starting from 0.21 up to 0.24.2

Configuration

[[transforms.transform_name.metrics]]
field = 'field-name'
name = 'name'
type = 'counter'

tags.tag1 = '{{tags.tag1}}'
tags.tag2 = '{{tags.tag2}}'
tags.tag3 = '{{tags.tag3}}'

...
tags.tag9 = '{{tags.tag9}}'

Version

0.21

Debug Output

No response

Example Data

No response

Additional Context

Vector is launched in Docker container

References

https://github.com/rust-lang/regex/issues/188 https://users.rust-lang.org/t/help-me-reduce-overhead-of-regex-matching/5220

jszwedko commented 2 years ago

Hi @dizlv !

Thanks for this report! We actually discovered this ourselves recently in https://github.com/vectordotdev/vector/issues/14864 . A team member is working on it so I expect we'll have a fix for it soon. I'll close this issue to roll it up into the other, but thank you for this detailed investigation.

jszwedko commented 2 years ago

Actually, we realized this is slightly different, but similar. I'll reopen to track it separately.

dizlv commented 2 years ago

Hey @jszwedko, thank you for quick response!

I have checked latest changes from @bruceg, and I wonder should regex matching be completely removed at least for matching and then at some point rendering replaced as well with custom template engine, instead of regex'es. Because with current regex library it might become noticeable bottleneck in other places too (I am not very familiar with Vector code base, so just assuming).

If you need more insights regards the performed tests, let me know. Whilst I might not be able to provide test input itself, I still can get perf reports and any other metrics that you might be interested in!

bruceg commented 2 years ago

FWIW I have another change in progress to template handling that completely removes regex parsing from the template render fast path. The change made is #14902 was a precursor to avoid a regression when that one is introduced, but doesn't itself make a big change to the behavior.

bruceg commented 2 years ago

@dizlv Could you also try out the changes in #14908 above, which should be present in the latest nightly build? This should completely eliminate any regex lock contention.

dizlv commented 2 years ago

Hey @bruceg,

Just ran performance tests with nightly build (vector 0.25.0 (x86_64-unknown-linux-gnu 0b83860 2022-10-28), now it works way much better, metrics are as expected low, no exploded buffers and CPU usage looks fine.

I haven't attached perf to this, since the process is somewhat tedious in that environment, but let me know if you want to look at it.

Any ETA when this might get released?

Thank you!

bruceg commented 2 years ago

I would be curious to see the perf data with the patch attached, if for no other reason than to see if there were any remaining obvious optimization targets. Having said that, I know all of the regex calls are gone removed from the hot path, and I am satisfied that the worst part of the above flamegraph will similarly be gone, so don't bother if it's going to be too much work.

It looks like those will make it into the 0.25.0 release, which is planned to go out early-mid next week.

dizlv commented 2 years ago

@bruceg I ran some tests with 0.25.1, it performed way better than previous ones, but I just discovered, that the problem is still there (initially I thought that the after fixes we had only network-related issues). perf-vector-0-25-1-log2metric

Also, I had no time to investigate this further, but when I disable log_to_metric components topology starts working very fast, but drops logs. I have a lot of transformations in between and then single Splunk endpoint sink (that's simply another Vector cluster) with configured buffer strategy to drop newest logs. It fills the buffers but never manages to send them to the endpoint timely and I wonder why – is it due to the TCP stack configuration or I should look somewhere else?

Apparently sink should send things faster, but it's not (utilization on that component is 100%).

I had also another idea – batches empties within 1 second and therefore it simply fails to keep up with the speed? Another things that I observed:

50% topology utilization on the receiving cluster (some S3 sinks)
High amount of adaptive_concurrency_in_flight on first cluster

jszwedko commented 2 years ago

Hey @dizlv ! Do you still think the issue here is with lock contention? If not, we may want to open a new issue to track this and see what other performance improvements we can make to that transform. If you think the issue is still with lock contention, we can reopen this issue.

dizlv commented 2 years ago

@jszwedko I think it is somewhat the same? I haven't traced where exactly this happens though.

jszwedko commented 2 years ago

👍 I'll re-open to get @bruceg's thoughts.

vectordotdev / vector