vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.12k stars 1.6k forks source link

fix(ARC): Fix possible deadlock in adaptive concurrency decrease #21344

Closed bruceg closed 1 month ago

bruceg commented 1 month ago

When the decrease ratio is set to a value less than 0.5, then the number of concurrency slots to forget in the case of back pressure can be set to the current concurrency limit. This causes the concurrency limit to be dropped down to zero, leading to a deadlock.

Fixes #21340

datadog-vectordotdev[bot] commented 1 month ago

Datadog Report

Branch report: bruceg/fix-adaptive-concurrency-deadlock Commit report: 1259833 Test service: vector

:white_check_mark: 0 Failed, 444 Passed, 0 Skipped, 4m 17.32s Total Time

github-actions[bot] commented 1 month ago

Regression Detector Results

Run ID: b20469c3-343f-4a47-b24c-db3f0a7cde9f Metrics dashboard

Baseline: 6e47077efb9ea78e757383da0e49db08f5378212 Comparison: 34f78db804ce5df0b9d515710d0699d3cf72ef07

Performance changes are noted in the perf column of each table:

No significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing `erratic: true` are ignored. | perf | experiment | goal | Δ mean % | Δ mean % CI | links | |------|-------------------|-------------------|----------|----------------|-------| | ➖ | file_to_blackhole | egress throughput | -0.12 | [-7.19, +6.96] | |

Fine details of change detection per experiment

| perf | experiment | goal | Δ mean % | Δ mean % CI | links | |------|---------------------------------------------------|--------------------|----------|----------------|-------| | ➖ | datadog_agent_remap_blackhole | ingress throughput | +3.77 | [+3.51, +4.04] | | | ➖ | syslog_regex_logs2metric_ddmetrics | ingress throughput | +1.52 | [+1.38, +1.66] | | | ➖ | http_to_http_acks | ingress throughput | +1.04 | [-0.18, +2.26] | | | ➖ | syslog_log2metric_humio_metrics | ingress throughput | +0.70 | [+0.56, +0.83] | | | ➖ | socket_to_socket_blackhole | ingress throughput | +0.41 | [+0.35, +0.46] | | | ➖ | http_to_http_noack | ingress throughput | +0.12 | [+0.05, +0.19] | | | ➖ | http_to_http_json | ingress throughput | +0.02 | [-0.02, +0.07] | | | ➖ | splunk_hec_to_splunk_hec_logs_noack | ingress throughput | +0.01 | [-0.08, +0.10] | | | ➖ | splunk_hec_to_splunk_hec_logs_acks | ingress throughput | -0.00 | [-0.09, +0.09] | | | ➖ | splunk_hec_indexer_ack_blackhole | ingress throughput | -0.01 | [-0.09, +0.07] | | | ➖ | file_to_blackhole | egress throughput | -0.12 | [-7.19, +6.96] | | | ➖ | http_to_s3 | ingress throughput | -0.56 | [-0.82, -0.29] | | | ➖ | fluent_elasticsearch | ingress throughput | -0.64 | [-1.12, -0.15] | | | ➖ | datadog_agent_remap_blackhole_acks | ingress throughput | -0.65 | [-0.77, -0.52] | | | ➖ | syslog_log2metric_splunk_hec_metrics | ingress throughput | -0.81 | [-0.92, -0.71] | | | ➖ | datadog_agent_remap_datadog_logs_acks | ingress throughput | -1.00 | [-1.19, -0.81] | | | ➖ | otlp_http_to_blackhole | ingress throughput | -1.09 | [-1.27, -0.90] | | | ➖ | syslog_humio_logs | ingress throughput | -1.17 | [-1.30, -1.05] | | | ➖ | http_elasticsearch | ingress throughput | -1.37 | [-1.58, -1.15] | | | ➖ | datadog_agent_remap_datadog_logs | ingress throughput | -1.47 | [-1.69, -1.25] | | | ➖ | splunk_hec_route_s3 | ingress throughput | -1.74 | [-2.05, -1.43] | | | ➖ | http_text_to_http_json | ingress throughput | -1.76 | [-1.89, -1.63] | | | ➖ | syslog_log2metric_tag_cardinality_limit_blackhole | ingress throughput | -1.82 | [-1.93, -1.72] | | | ➖ | syslog_loki | ingress throughput | -1.93 | [-1.99, -1.87] | | | ➖ | otlp_grpc_to_blackhole | ingress throughput | -2.04 | [-2.16, -1.93] | | | ➖ | syslog_splunk_hec_logs | ingress throughput | -2.96 | [-3.05, -2.87] | |

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true: 1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look. 2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that *if our statistical model is accurate*, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants. 3. Its configuration does not mark it "erratic".
github-actions[bot] commented 1 month ago

Regression Detector Results

Run ID: e8cc3937-25cb-42fe-b6d7-e23431625d4d Metrics dashboard

Baseline: f99e052b54fc9c32731694f258b30360e28b68ac Comparison: 219dd4dd6dca667c133a1f68dd19013b8945946e

Performance changes are noted in the perf column of each table:

Significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

perf experiment goal Δ mean % Δ mean % CI links
otlp_http_to_blackhole ingress throughput -5.34 [-5.48, -5.20]

Experiments ignored for regressions

Regressions in experiments with settings containing `erratic: true` are ignored. | perf | experiment | goal | Δ mean % | Δ mean % CI | links | |------|-------------------|-------------------|----------|-----------------|-------| | ❌ | file_to_blackhole | egress throughput | -13.82 | [-20.03, -7.62] | |

Fine details of change detection per experiment

| perf | experiment | goal | Δ mean % | Δ mean % CI | links | |------|---------------------------------------------------|--------------------|----------|-----------------|-------| | ➖ | datadog_agent_remap_datadog_logs_acks | ingress throughput | +4.46 | [+4.08, +4.84] | | | ➖ | syslog_log2metric_humio_metrics | ingress throughput | +2.85 | [+2.68, +3.03] | | | ➖ | datadog_agent_remap_blackhole_acks | ingress throughput | +1.80 | [+1.69, +1.92] | | | ➖ | datadog_agent_remap_blackhole | ingress throughput | +0.66 | [+0.55, +0.77] | | | ➖ | http_to_s3 | ingress throughput | +0.60 | [+0.28, +0.92] | | | ➖ | otlp_grpc_to_blackhole | ingress throughput | +0.44 | [+0.33, +0.56] | | | ➖ | splunk_hec_route_s3 | ingress throughput | +0.34 | [+0.02, +0.66] | | | ➖ | http_to_http_acks | ingress throughput | +0.26 | [-0.97, +1.49] | | | ➖ | syslog_log2metric_splunk_hec_metrics | ingress throughput | +0.18 | [+0.06, +0.31] | | | ➖ | http_to_http_noack | ingress throughput | +0.12 | [+0.05, +0.19] | | | ➖ | fluent_elasticsearch | ingress throughput | +0.05 | [-0.44, +0.54] | | | ➖ | http_to_http_json | ingress throughput | +0.03 | [-0.01, +0.08] | | | ➖ | splunk_hec_to_splunk_hec_logs_acks | ingress throughput | +0.01 | [-0.10, +0.11] | | | ➖ | splunk_hec_to_splunk_hec_logs_noack | ingress throughput | -0.00 | [-0.10, +0.09] | | | ➖ | splunk_hec_indexer_ack_blackhole | ingress throughput | -0.01 | [-0.09, +0.06] | | | ➖ | http_text_to_http_json | ingress throughput | -0.06 | [-0.18, +0.06] | | | ➖ | socket_to_socket_blackhole | ingress throughput | -0.08 | [-0.14, -0.01] | | | ➖ | syslog_loki | ingress throughput | -0.08 | [-0.15, -0.01] | | | ➖ | http_elasticsearch | ingress throughput | -0.75 | [-0.91, -0.60] | | | ➖ | syslog_regex_logs2metric_ddmetrics | ingress throughput | -1.28 | [-1.43, -1.12] | | | ➖ | syslog_splunk_hec_logs | ingress throughput | -1.28 | [-1.38, -1.17] | | | ➖ | datadog_agent_remap_datadog_logs | ingress throughput | -1.79 | [-2.04, -1.53] | | | ➖ | syslog_humio_logs | ingress throughput | -2.11 | [-2.23, -2.00] | | | ➖ | syslog_log2metric_tag_cardinality_limit_blackhole | ingress throughput | -3.59 | [-3.68, -3.51] | | | ❌ | otlp_http_to_blackhole | ingress throughput | -5.34 | [-5.48, -5.20] | | | ❌ | file_to_blackhole | egress throughput | -13.82 | [-20.03, -7.62] | |

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true: 1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look. 2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that *if our statistical model is accurate*, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants. 3. Its configuration does not mark it "erratic".
github-actions[bot] commented 1 month ago

Regression Detector Results

Run ID: 9f738ea4-71f4-4bd1-84ee-fc03c43e6b8b Metrics dashboard

Baseline: f99e052b54fc9c32731694f258b30360e28b68ac Comparison: ca0fa057eaa128beb7777428f79cec9924f1d396

Performance changes are noted in the perf column of each table:

No significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing `erratic: true` are ignored. | perf | experiment | goal | Δ mean % | Δ mean % CI | links | |------|-------------------|-------------------|----------|-----------------|-------| | ➖ | file_to_blackhole | egress throughput | -6.08 | [-12.83, +0.68] | |

Fine details of change detection per experiment

| perf | experiment | goal | Δ mean % | Δ mean % CI | links | |------|---------------------------------------------------|--------------------|----------|-----------------|-------| | ➖ | http_elasticsearch | ingress throughput | +3.20 | [+3.03, +3.36] | | | ➖ | fluent_elasticsearch | ingress throughput | +2.69 | [+2.19, +3.18] | | | ➖ | syslog_log2metric_humio_metrics | ingress throughput | +2.22 | [+2.13, +2.32] | | | ➖ | datadog_agent_remap_blackhole_acks | ingress throughput | +2.21 | [+2.10, +2.31] | | | ➖ | datadog_agent_remap_datadog_logs | ingress throughput | +1.19 | [+1.00, +1.38] | | | ➖ | http_to_http_acks | ingress throughput | +1.19 | [-0.05, +2.43] | | | ➖ | http_text_to_http_json | ingress throughput | +0.50 | [+0.36, +0.64] | | | ➖ | http_to_http_noack | ingress throughput | +0.15 | [+0.07, +0.23] | | | ➖ | splunk_hec_indexer_ack_blackhole | ingress throughput | +0.02 | [-0.06, +0.10] | | | ➖ | http_to_http_json | ingress throughput | +0.02 | [-0.02, +0.05] | | | ➖ | splunk_hec_to_splunk_hec_logs_noack | ingress throughput | +0.01 | [-0.08, +0.10] | | | ➖ | splunk_hec_to_splunk_hec_logs_acks | ingress throughput | -0.00 | [-0.11, +0.10] | | | ➖ | syslog_log2metric_splunk_hec_metrics | ingress throughput | -0.31 | [-0.40, -0.22] | | | ➖ | syslog_loki | ingress throughput | -0.32 | [-0.39, -0.24] | | | ➖ | datadog_agent_remap_datadog_logs_acks | ingress throughput | -0.42 | [-0.60, -0.24] | | | ➖ | datadog_agent_remap_blackhole | ingress throughput | -0.72 | [-0.82, -0.63] | | | ➖ | http_to_s3 | ingress throughput | -0.93 | [-1.21, -0.65] | | | ➖ | syslog_splunk_hec_logs | ingress throughput | -1.28 | [-1.37, -1.20] | | | ➖ | syslog_log2metric_tag_cardinality_limit_blackhole | ingress throughput | -1.46 | [-1.54, -1.38] | | | ➖ | otlp_grpc_to_blackhole | ingress throughput | -1.47 | [-1.57, -1.36] | | | ➖ | syslog_humio_logs | ingress throughput | -1.61 | [-1.73, -1.49] | | | ➖ | syslog_regex_logs2metric_ddmetrics | ingress throughput | -1.77 | [-1.92, -1.61] | | | ➖ | socket_to_socket_blackhole | ingress throughput | -2.36 | [-2.41, -2.30] | | | ➖ | splunk_hec_route_s3 | ingress throughput | -2.56 | [-2.85, -2.27] | | | ➖ | otlp_http_to_blackhole | ingress throughput | -3.18 | [-3.30, -3.07] | | | ➖ | file_to_blackhole | egress throughput | -6.08 | [-12.83, +0.68] | |

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true: 1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look. 2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that *if our statistical model is accurate*, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants. 3. Its configuration does not mark it "erratic".