vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.49k stars 1.53k forks source link

vector top takes almost 2 minutes to start #16465

Open jblang opened 1 year ago

jblang commented 1 year ago

A note for the community

Problem

When I run vector top, it takes almost 2 minutes before anything appears on the screen. The screen clears with the cursor on the bottom left. It will quit if q is pressed during this time, so it's not completely hung.

I ran it under strace and the call that takes almost 2 minutes seems to be something to do with opening the tty device:

   115.257863 openat(AT_FDCWD, "/dev/tty", O_RDONLY|O_CLOEXEC) = 9

I'm using WSL2 with Windows Terminal on my local laptop. I have coworkers who confirmed the same issue connecting to the same Vector instance on both Mac and Linux though.

Here is the full strace. This was run with relative timestamps so the number at the beginning of the line is the time each syscall took in seconds.

I am running vector top on my laptop against Vector running in a pod that was deployed using the official helm charts and has the port forwarded using kubectl port-forward pod/vector-0 8686.

If I exec into the pod using kubectl exec -it vector-0 -- /bin/sh, then run vector top, it also hangs for the same amount of time.

Running vector tap does not take very long to start so I don't think it's related to network latency. We are running Vector 0.25.1 in the pod, but I tried upgrading my local vector to 0.27 to see if that made a difference and it did not.

Configuration

I don't think the specific configuration is particularly relevant, but API is configured as follows:

  api:
    address: 127.0.0.1:8686
    enabled: true
    playground: true

Version

0.25.1

Debug Output

This did not produce any additional output.

Example Data

No response

Additional Context

No response

References

No response

zamazan4ik commented 1 year ago

Sounds strange. I have tried to do the same things locally on my machine (Macbook M1 Pro) but it works fine. Could you please try to run a Vector instance locally without k8s and then connect to it once again?

bruceg commented 1 year ago

I have investigated the issue and, while I am unable to reproduce the problem on Linux, I can definitely see the pattern observed above in the strace. This comes from a size function in the crossterm crate, which needs the TTY device for determining the terminal size, which may change at runtime. Now, as far as I know, it does not need to close and reopen the device each time, so that seems wasteful, but it should be harmless. The fact that it is taking almost two minutes within the system call is somewhat out of our hands. I am, however, pursuing the issue with the crossterm devs as I have some ideas how it might be fixable.

jblang commented 1 year ago

I just tested vector top against a locally running vector instance and it started immediately. However this was just for a simple config that passes a stdin source to a stdout sink. But since you can see that the delay is inside the system call, it seems unlikely that it's related to running locally or not. In the real vector deployment we have many more transforms defined, so maybe it's caused somehow by the amount of components that vector top needs to display? It's not so easy to run our real vector config locally but I will see if I can figure out a way to reproduce this.

jblang commented 1 year ago

I wasn't able to reproduce this locally but I may have narrowed down the issue. I believe the delay may be related to the number of events that have been processed (perhaps when the number gets too wide to be displayed in a column?) We had over 56B events processed by some of our busiest transforms with vector having been up for about 15 days. When I restarted the vector pods, I found that vector top started up much quicker with only a few minutes worth of events to report. This is the output of vector top when it was taking multiple minutes to start:

β”ŒVector─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
β”‚http://127.0.0.1:8686/graphql | Sampling @ 500ms | Connected                                                                                                                                                                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”ŒComponents─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
β”‚ID                                   Output          Kind                Type                     Events In                Events Out               Bytes                    Errors       Mem Usage Bytes                                  β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚cleanup_fields                       --              transform           remap                    1,153,600,451 (4.88 k/s  1,153,600,496 (15.92 k/  N/A                      --           --                                               β”‚
β”‚fingerprint_exception                --              transform           remap                    39,560,883 (--/s)        39,560,883 (274/s)       N/A                      --           --                                               β”‚
β”‚fluent_in                            --              source              fluent                   1,379,567,733 (5.84 k/s  1,379,567,734 (14.76 k/  2,583,418,105,387 (--/s  --           --                                               β”‚
β”‚gcs_prom_convert                     --              transform           metric_to_log            142,314,425 (1.29 k/s)   142,314,425 (864/s)      N/A                      --           --                                               β”‚
β”‚gcs_prom_filter                      --              transform           filter                   18,763,983,516 (296.09   142,331,366 (1.73 k/s)   N/A                      --           --                                               β”‚
β”‚gcs_prom_sink                        --              sink                gcp_cloud_storage        142,341,520 (1.29 k/s)   142,339,956 (--/s)       53,968,867,333 (--/s)    --           --                                               β”‚
β”‚gcs_prom_source                      --              source              prometheus_remote_write  18,764,136,487 (236.82   18,764,093,485 (68.98 k  1,276,554,399,947 (--/s  --           --                                               β”‚
β”‚gcs_prom_tag                         --              transform           remap                    142,328,788 (1.19 k/s)   142,328,788 (864/s)      N/A                      --           --                                               β”‚
β”‚internal_metrics_cardinality_limit   --              transform           tag_cardinality_limit    56,209,095,222 (234.00   56,209,069,504 (452.09   N/A                      --           --                                               β”‚
β”‚internal_metrics_prom_sink           --              sink                prometheus_exporter      56,209,498,381 (323.47   41,445,335 (--/s)        30,562,853,677 (--/s)    --           --                                               β”‚
β”‚internal_metrics_source              --              source              internal_metrics         56,208,037,222 (2.36 M/  56,208,037,222 (--/s)    41,843,232,644,280 (--/  --           --                                               β”‚
β”‚internal_metrics_tag                 --              transform           remap                    56,211,772,574 (272.00   56,211,780,574 (466.00   N/A                      --           --                                               β”‚
β”‚mom_alert_to_metric_count            --              transform           log_to_metric            801 (--/s)               801 (--/s)               N/A                      --           --                                               β”‚
β”‚mom_alert_to_metric_enrich           --              transform           remap                    800 (--/s)               800 (--/s)               N/A                      --           --                                               β”‚
β”‚mom_alertmanager_watchdog_alert      --              source              http                     800 (--/s)               800 (--/s)               1,816,740 (--/s)         --           --                                               β”‚
β”‚mom_datadog_watchdog_metric          --              sink                datadog_metrics          801 (--/s)               801 (--/s)               149,787 (--/s)           --           --                                               β”‚
β”‚normalize_fields                     --              transform           remap                    1,370,851,593 (5.79 k/s  1,370,856,829 (14.19 k/  N/A                      --           --                                               β”‚
β”‚normalize_fluent                     --              transform           remap                    1,379,721,507 (3.79 k/s  1,379,721,237 (20.29 k/  N/A                      --           --                                               β”‚
β”‚parse_container                      --              transform           remap                    1,370,928,450 (5.86 k/s  1,370,927,903 (16.50 k/  N/A                      --           --                                               β”‚
β”‚route_container                                      transform           route                    1,370,917,577 (6.02 k/s  1,370,917,401 (13.00 k/  N/A                      --           --                                               β”‚
β”‚                                     exception                                                                             39,561,447 (286/s)                                                                                              β”‚
β”‚                                     _unmatched                                                                            1,331,349,909 (12.71 k/                                                                                         β”‚
β”‚route_fluent                                         transform           route                    1,379,625,087 (5.84 k/s  1,379,623,751 (15.30 k/  N/A                      --           --                                               β”‚
β”‚                                     container                                                                             1,370,720,373 (15.25 k/                                                                                         β”‚
β”‚                                     _unmatched                                                                            8,898,131 (52/s)                                                                                                β”‚
β”‚splunk_logs_out                      --              sink                splunk_hec_logs          1,162,399,635 (3.44 k/s  1,162,400,679 (16.13 k/  3,764,221,495,100 (--/s  --           --                                               β”‚
β”‚throttle_high                        --              transform           throttle                 926,969,714 (2.39 k/s)   864,442,698 (8.38 k/s)   N/A                      --           --                                               β”‚
β”‚throttle_key                         --              transform           remap                    1,370,749,471 (3.22 k/s  1,370,749,471 (14.11 k/  N/A                      --           --                                               β”‚
β”‚throttle_low                         --              transform           throttle                 18,749,914 (42/s)        10,352,239 (104/s)       N/A                      --           --                                               β”‚
β”‚throttle_lowest                      --              transform           throttle                 73,375,765 (262/s)       13,495,426 (98/s)        N/A                      --           --                                               β”‚
β”‚throttle_medium                      --              transform           throttle                 350,543,543 (768/s)      264,082,783 (2.68 k/s)   N/A                      --           --                                               β”‚
β”‚throttle_route                                       transform           route                    1,370,698,101 (2.47 k/s  1,370,698,021 (16.78 k/  N/A                      --           --                                               β”‚
β”‚                                     high                                                                                  926,895,855 (11.16 k/s)                                                                                         β”‚
β”‚                                     low                                                                                   18,748,687 (140/s)                                                                                              β”‚
β”‚                                     medium                                                                                350,551,387 (4.94 k/s)                                                                                          β”‚
β”‚                                     lowest                                                                                73,365,596 (540/s)                                                                                              β”‚
β”‚                                     _unmatched                                                                            1,134,752 (2/s)                                                                                                 β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β”‚                                                                                                                                                                                                                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚To quit, press ESC or 'q'                                                                                                                                                                                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I set up the following dummy pipeline to try to reproduce it:

api:
  enabled: true

sources:
  input_events:
    type: stdin
    host_key: ""

transforms:
  extraneous_transform_with_a_really_long_name_1:
      type: remap
      inputs:
        - input_events
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_2:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_1
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_3:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_2
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_4:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_3
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_5:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_4
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_6:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_5
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_7:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_6
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_8:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_7
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_9:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_8
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_10:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_9
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_11:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_10
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_12:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_11
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_13:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_12
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_14:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_13
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_15:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_14
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_16:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_15
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_17:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_16
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_18:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_17
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_19:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_18
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_20:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_19
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_21:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_20
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_22:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_21
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_23:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_22
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_24:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_23
      source: |-
        # passthru
  extraneous_transform_with_a_really_long_name_25:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_24
      source: |-
        # passthru
  final_events:
      type: remap
      inputs:
        - extraneous_transform_with_a_really_long_name_25
      source: |-
        # passthru

sinks:
  blackhole:
    inputs:
      - final_events
    type: blackhole

I am running this with:

yes | vector --config dummy.yaml

And then running vector top in a different terminal.

So far, I haven't been able to repro but it hasn't reached nearly the number of events processed by our real vector deployment yet. I will leave it running for a while and see what happens.

jblang commented 1 year ago

Still no luck in reproducing locally, but I did get a co-worker to confirm he hits the same issue from Linux. I've removed WSL2 from the subject to reflect that it does not appear to be specific to the client OS used. Also found that it comes up almost immediately if I set the --interval to 1 instead of the default 500 ms.

So reiterating my findings:

jblang commented 1 year ago

The more I think about it, the more I am convinced it has something to do with what Vector is trying to display on screen. Possibly during initialization, it's trying to calculate the optimal column size for some rapidly fluctuating values and it keeps oscillating until it finds equilibrium. The 1ms polling interval would allow the oscillations to stabilize more quickly before it settles on a size for each column. When I tried to reproduce locally, I was feeding it a steady stream of events from the yes command so it wouldn't have been oscillating.

jblang commented 1 year ago

@bruceg bump. Just wanted to make sure you saw my last comment on this.

jblang commented 9 months ago

Bumping this again... any possibility I'm right about my theory in my last comment?

bruceg commented 9 months ago

It seems likely that it has something to do with what it is trying to display. I am doubtful about it having anything to do with fluctuating values and trying to settle on a size, though, given what I had seen in the code previously. It could be, though there isn't anything explicitly looping to settle on a value. We haven't had time to look at this, but doing a quick look now, I wonder if there might be a channel size issue. What happens if you change the channel sizes in src/top/*.rs from 20 to something much larger? Also, is the CPU load high while looping? It might be hard to tell if Vector is busy, I'm just wondering if there is some quadratic or worse behavior in the updater function or some other initialization.