Open dizlv opened 2 years ago
Hi @dizlv !
Thanks for this report! We actually discovered this ourselves recently in https://github.com/vectordotdev/vector/issues/14864 . A team member is working on it so I expect we'll have a fix for it soon. I'll close this issue to roll it up into the other, but thank you for this detailed investigation.
Actually, we realized this is slightly different, but similar. I'll reopen to track it separately.
Hey @jszwedko, thank you for quick response!
I have checked latest changes from @bruceg, and I wonder should regex matching be completely removed at least for matching and then at some point rendering replaced as well with custom template engine, instead of regex'es. Because with current regex
library it might become noticeable bottleneck in other places too (I am not very familiar with Vector code base, so just assuming).
If you need more insights regards the performed tests, let me know. Whilst I might not be able to provide test input itself, I still can get perf
reports and any other metrics that you might be interested in!
FWIW I have another change in progress to template handling that completely removes regex parsing from the template render fast path. The change made is #14902 was a precursor to avoid a regression when that one is introduced, but doesn't itself make a big change to the behavior.
@dizlv Could you also try out the changes in #14908 above, which should be present in the latest nightly build? This should completely eliminate any regex lock contention.
Hey @bruceg,
Just ran performance tests with nightly build (vector 0.25.0 (x86_64-unknown-linux-gnu 0b83860 2022-10-28
), now it works way much better, metrics are as expected low, no exploded buffers and CPU usage looks fine.
I haven't attached perf
to this, since the process is somewhat tedious in that environment, but let me know if you want to look at it.
Any ETA when this might get released?
Thank you!
I would be curious to see the perf
data with the patch attached, if for no other reason than to see if there were any remaining obvious optimization targets. Having said that, I know all of the regex calls are gone removed from the hot path, and I am satisfied that the worst part of the above flamegraph will similarly be gone, so don't bother if it's going to be too much work.
It looks like those will make it into the 0.25.0 release, which is planned to go out early-mid next week.
@bruceg I ran some tests with 0.25.1
, it performed way better than previous ones, but I just discovered, that the problem is still there (initially I thought that the after fixes we had only network-related issues).
Also, I had no time to investigate this further, but when I disable log_to_metric
components topology starts working very fast, but drops logs. I have a lot of transformations in between and then single Splunk endpoint sink (that's simply another Vector cluster) with configured buffer strategy to drop newest logs. It fills the buffers but never manages to send them to the endpoint timely and I wonder why – is it due to the TCP stack configuration or I should look somewhere else?
Apparently sink should send things faster, but it's not (utilization on that component is 100%).
I had also another idea – batches empties within 1 second and therefore it simply fails to keep up with the speed? Another things that I observed:
adaptive_concurrency_in_flight
on first clusterHey @dizlv ! Do you still think the issue here is with lock contention? If not, we may want to open a new issue to track this and see what other performance improvements we can make to that transform. If you think the issue is still with lock contention, we can reopen this issue.
@jszwedko I think it is somewhat the same? I haven't traced where exactly this happens though.
👍 I'll re-open to get @bruceg's thoughts.
A note for the community
Problem
I have encountered strange performance degradation during Vector upgrade from
0.18
to0.24.2
. After dozens of performance tests and investigations I think I have found the issue.Tests were performed on
C5.9xlarge
Instance (AWS). Vector is launched in Docker. At the end of the topology we have somelog_to_metric
components that have multiple tags, something like this:I tried to bisect the version in which this degradation started to happen. So the first time I found it was in
0.21
.0.18
,0.19
and0.20
worked fine, but this might be simply due to the fact that I did not manage to apply the same workload on them, whereas starting from0.21
Vector is capable of handling way more workload.Next step was to bisect components in the topology. At the end I managed to deduce, that once I disable
log_to_metric
– instance works just fine, but should I enablelog_to_metric
and CPU usage goes up to 100% and component buffers are exploding.From that point on I understood, that something is clearly wrong with that and decided to check what type of workload actually CPU is handling.
Here we see that 81% of whole CPU is being spent in
log_to_metric
component, more specifically in render_tags that looks like this:From the Flamegraph we also see that last time before locks are thrown into the regex matching is executed in template
TryFrom
trait implementation. So we follow the code logic fromrender_tags
. Apparently for every single “tag” (which can be counter, histogram, etc) we execute another functionrender_template
:Within
render_template
we seetry_from
with reference to the string, which triggersTryFrom
trait. Following implementation:Some related links:
I managed to reduce locks congestion by limiting vector threads amount. The second option is to remove this component from the topology.
Worth mentioning that this degradation has been observed in every release starting from 0.21 up to 0.24.2
Configuration
Version
0.21
Debug Output
No response
Example Data
No response
Additional Context
Vector is launched in Docker container
References
https://github.com/rust-lang/regex/issues/188 https://users.rust-lang.org/t/help-me-reduce-overhead-of-regex-matching/5220