Open jblang opened 1 year ago
Sounds strange. I have tried to do the same things locally on my machine (Macbook M1 Pro) but it works fine. Could you please try to run a Vector instance locally without k8s and then connect to it once again?
I have investigated the issue and, while I am unable to reproduce the problem on Linux, I can definitely see the pattern observed above in the strace. This comes from a size
function in the crossterm crate, which needs the TTY device for determining the terminal size, which may change at runtime. Now, as far as I know, it does not need to close and reopen the device each time, so that seems wasteful, but it should be harmless. The fact that it is taking almost two minutes within the system call is somewhat out of our hands. I am, however, pursuing the issue with the crossterm devs as I have some ideas how it might be fixable.
I just tested vector top
against a locally running vector instance and it started immediately. However this was just for a simple config that passes a stdin source to a stdout sink. But since you can see that the delay is inside the system call, it seems unlikely that it's related to running locally or not. In the real vector deployment we have many more transforms defined, so maybe it's caused somehow by the amount of components that vector top
needs to display? It's not so easy to run our real vector config locally but I will see if I can figure out a way to reproduce this.
I wasn't able to reproduce this locally but I may have narrowed down the issue. I believe the delay may be related to the number of events that have been processed (perhaps when the number gets too wide to be displayed in a column?) We had over 56B events processed by some of our busiest transforms with vector having been up for about 15 days. When I restarted the vector pods, I found that vector top
started up much quicker with only a few minutes worth of events to report. This is the output of vector top
when it was taking multiple minutes to start:
βVectorββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βhttp://127.0.0.1:8686/graphql | Sampling @ 500ms | Connected β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βComponentsββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βID Output Kind Type Events In Events Out Bytes Errors Mem Usage Bytes β
β β
βcleanup_fields -- transform remap 1,153,600,451 (4.88 k/s 1,153,600,496 (15.92 k/ N/A -- -- β
βfingerprint_exception -- transform remap 39,560,883 (--/s) 39,560,883 (274/s) N/A -- -- β
βfluent_in -- source fluent 1,379,567,733 (5.84 k/s 1,379,567,734 (14.76 k/ 2,583,418,105,387 (--/s -- -- β
βgcs_prom_convert -- transform metric_to_log 142,314,425 (1.29 k/s) 142,314,425 (864/s) N/A -- -- β
βgcs_prom_filter -- transform filter 18,763,983,516 (296.09 142,331,366 (1.73 k/s) N/A -- -- β
βgcs_prom_sink -- sink gcp_cloud_storage 142,341,520 (1.29 k/s) 142,339,956 (--/s) 53,968,867,333 (--/s) -- -- β
βgcs_prom_source -- source prometheus_remote_write 18,764,136,487 (236.82 18,764,093,485 (68.98 k 1,276,554,399,947 (--/s -- -- β
βgcs_prom_tag -- transform remap 142,328,788 (1.19 k/s) 142,328,788 (864/s) N/A -- -- β
βinternal_metrics_cardinality_limit -- transform tag_cardinality_limit 56,209,095,222 (234.00 56,209,069,504 (452.09 N/A -- -- β
βinternal_metrics_prom_sink -- sink prometheus_exporter 56,209,498,381 (323.47 41,445,335 (--/s) 30,562,853,677 (--/s) -- -- β
βinternal_metrics_source -- source internal_metrics 56,208,037,222 (2.36 M/ 56,208,037,222 (--/s) 41,843,232,644,280 (--/ -- -- β
βinternal_metrics_tag -- transform remap 56,211,772,574 (272.00 56,211,780,574 (466.00 N/A -- -- β
βmom_alert_to_metric_count -- transform log_to_metric 801 (--/s) 801 (--/s) N/A -- -- β
βmom_alert_to_metric_enrich -- transform remap 800 (--/s) 800 (--/s) N/A -- -- β
βmom_alertmanager_watchdog_alert -- source http 800 (--/s) 800 (--/s) 1,816,740 (--/s) -- -- β
βmom_datadog_watchdog_metric -- sink datadog_metrics 801 (--/s) 801 (--/s) 149,787 (--/s) -- -- β
βnormalize_fields -- transform remap 1,370,851,593 (5.79 k/s 1,370,856,829 (14.19 k/ N/A -- -- β
βnormalize_fluent -- transform remap 1,379,721,507 (3.79 k/s 1,379,721,237 (20.29 k/ N/A -- -- β
βparse_container -- transform remap 1,370,928,450 (5.86 k/s 1,370,927,903 (16.50 k/ N/A -- -- β
βroute_container transform route 1,370,917,577 (6.02 k/s 1,370,917,401 (13.00 k/ N/A -- -- β
β exception 39,561,447 (286/s) β
β _unmatched 1,331,349,909 (12.71 k/ β
βroute_fluent transform route 1,379,625,087 (5.84 k/s 1,379,623,751 (15.30 k/ N/A -- -- β
β container 1,370,720,373 (15.25 k/ β
β _unmatched 8,898,131 (52/s) β
βsplunk_logs_out -- sink splunk_hec_logs 1,162,399,635 (3.44 k/s 1,162,400,679 (16.13 k/ 3,764,221,495,100 (--/s -- -- β
βthrottle_high -- transform throttle 926,969,714 (2.39 k/s) 864,442,698 (8.38 k/s) N/A -- -- β
βthrottle_key -- transform remap 1,370,749,471 (3.22 k/s 1,370,749,471 (14.11 k/ N/A -- -- β
βthrottle_low -- transform throttle 18,749,914 (42/s) 10,352,239 (104/s) N/A -- -- β
βthrottle_lowest -- transform throttle 73,375,765 (262/s) 13,495,426 (98/s) N/A -- -- β
βthrottle_medium -- transform throttle 350,543,543 (768/s) 264,082,783 (2.68 k/s) N/A -- -- β
βthrottle_route transform route 1,370,698,101 (2.47 k/s 1,370,698,021 (16.78 k/ N/A -- -- β
β high 926,895,855 (11.16 k/s) β
β low 18,748,687 (140/s) β
β medium 350,551,387 (4.94 k/s) β
β lowest 73,365,596 (540/s) β
β _unmatched 1,134,752 (2/s) β
β β
β β
β β
β β
β β
β β
β β
β β
β β
β β
β β
β β
β β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βTo quit, press ESC or 'q' β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
I set up the following dummy pipeline to try to reproduce it:
api:
enabled: true
sources:
input_events:
type: stdin
host_key: ""
transforms:
extraneous_transform_with_a_really_long_name_1:
type: remap
inputs:
- input_events
source: |-
# passthru
extraneous_transform_with_a_really_long_name_2:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_1
source: |-
# passthru
extraneous_transform_with_a_really_long_name_3:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_2
source: |-
# passthru
extraneous_transform_with_a_really_long_name_4:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_3
source: |-
# passthru
extraneous_transform_with_a_really_long_name_5:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_4
source: |-
# passthru
extraneous_transform_with_a_really_long_name_6:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_5
source: |-
# passthru
extraneous_transform_with_a_really_long_name_7:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_6
source: |-
# passthru
extraneous_transform_with_a_really_long_name_8:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_7
source: |-
# passthru
extraneous_transform_with_a_really_long_name_9:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_8
source: |-
# passthru
extraneous_transform_with_a_really_long_name_10:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_9
source: |-
# passthru
extraneous_transform_with_a_really_long_name_11:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_10
source: |-
# passthru
extraneous_transform_with_a_really_long_name_12:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_11
source: |-
# passthru
extraneous_transform_with_a_really_long_name_13:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_12
source: |-
# passthru
extraneous_transform_with_a_really_long_name_14:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_13
source: |-
# passthru
extraneous_transform_with_a_really_long_name_15:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_14
source: |-
# passthru
extraneous_transform_with_a_really_long_name_16:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_15
source: |-
# passthru
extraneous_transform_with_a_really_long_name_17:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_16
source: |-
# passthru
extraneous_transform_with_a_really_long_name_18:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_17
source: |-
# passthru
extraneous_transform_with_a_really_long_name_19:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_18
source: |-
# passthru
extraneous_transform_with_a_really_long_name_20:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_19
source: |-
# passthru
extraneous_transform_with_a_really_long_name_21:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_20
source: |-
# passthru
extraneous_transform_with_a_really_long_name_22:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_21
source: |-
# passthru
extraneous_transform_with_a_really_long_name_23:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_22
source: |-
# passthru
extraneous_transform_with_a_really_long_name_24:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_23
source: |-
# passthru
extraneous_transform_with_a_really_long_name_25:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_24
source: |-
# passthru
final_events:
type: remap
inputs:
- extraneous_transform_with_a_really_long_name_25
source: |-
# passthru
sinks:
blackhole:
inputs:
- final_events
type: blackhole
I am running this with:
yes | vector --config dummy.yaml
And then running vector top
in a different terminal.
So far, I haven't been able to repro but it hasn't reached nearly the number of events processed by our real vector deployment yet. I will leave it running for a while and see what happens.
Still no luck in reproducing locally, but I did get a co-worker to confirm he hits the same issue from Linux. I've removed WSL2 from the subject to reflect that it does not appear to be specific to the client OS used. Also found that it comes up almost immediately if I set the --interval
to 1 instead of the default 500 ms.
So reiterating my findings:
exec
into the pod and run it thereThe more I think about it, the more I am convinced it has something to do with what Vector is trying to display on screen. Possibly during initialization, it's trying to calculate the optimal column size for some rapidly fluctuating values and it keeps oscillating until it finds equilibrium. The 1ms polling interval would allow the oscillations to stabilize more quickly before it settles on a size for each column. When I tried to reproduce locally, I was feeding it a steady stream of events from the yes
command so it wouldn't have been oscillating.
@bruceg bump. Just wanted to make sure you saw my last comment on this.
Bumping this again... any possibility I'm right about my theory in my last comment?
It seems likely that it has something to do with what it is trying to display. I am doubtful about it having anything to do with fluctuating values and trying to settle on a size, though, given what I had seen in the code previously. It could be, though there isn't anything explicitly looping to settle on a value. We haven't had time to look at this, but doing a quick look now, I wonder if there might be a channel size issue. What happens if you change the channel sizes in src/top/*.rs
from 20 to something much larger? Also, is the CPU load high while looping? It might be hard to tell if Vector is busy, I'm just wondering if there is some quadratic or worse behavior in the updater
function or some other initialization.
A note for the community
Problem
When I run
vector top
, it takes almost 2 minutes before anything appears on the screen. The screen clears with the cursor on the bottom left. It will quit ifq
is pressed during this time, so it's not completely hung.I ran it under strace and the call that takes almost 2 minutes seems to be something to do with opening the tty device:
I'm using WSL2 with Windows Terminal on my local laptop. I have coworkers who confirmed the same issue connecting to the same Vector instance on both Mac and Linux though.
Here is the full strace. This was run with relative timestamps so the number at the beginning of the line is the time each syscall took in seconds.
I am running
vector top
on my laptop against Vector running in a pod that was deployed using the official helm charts and has the port forwarded usingkubectl port-forward pod/vector-0 8686
.If I exec into the pod using
kubectl exec -it vector-0 -- /bin/sh
, then runvector top
, it also hangs for the same amount of time.Running
vector tap
does not take very long to start so I don't think it's related to network latency. We are running Vector 0.25.1 in the pod, but I tried upgrading my local vector to 0.27 to see if that made a difference and it did not.Configuration
Version
0.25.1
Debug Output
Example Data
No response
Additional Context
No response
References
No response