vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.27k stars 1.61k forks source link

Vector doesn't start after upgrading from v0.33.1-1 to v.0.34.1-1 #19333

Open LinoVieira13 opened 11 months ago

LinoVieira13 commented 11 months ago

A note for the community

Problem

Hello, I'm opening this bug at the request of jszwedko via discord. So, I'm using an Ubuntu 22.04 machine, and I have vector v0.33.1-1 already installed and working, however, when I try to upgrade the package via Ubuntu APT to version v.0.34.1-1 it fails to start after 1m30s (timeout). I don't have a /etc/vector/vector.yaml file, because I'm using environment variables to point the VECTOR_CONFIG_DIR to this folder: /etc/vector/conf.d. This is the error when it tries to start the service after the upgrade:

$ sudo systemctl status vector

● vector.service - Vector
     Loaded: loaded (/lib/systemd/system/vector.service; enabled; vendor preset: enabled)
    Drop-In: /run/systemd/system/service.d
             └─zzz-lxc-service.conf
     Active: deactivating (stop-sigterm) (Result: timeout) since Thu 2023-12-07 12:31:58 UTC; 3min 27s ago
       Docs: https://vector.dev
Cntrl PID: 610422 (vector)
      Tasks: 58 (limit: 154100)
     Memory: 17.9M
     CGroup: /system.slice/vector.service
             └─610422 /usr/bin/vector validate

Dec 07 12:32:46 vector-01 vector[610422]: 2023-12-07T12:32:46.453286Z DEBUG vector::topology::builder: Building new source. component=XXXXXX
Dec 07 12:32:46 vector-01 vector[610422]: 2023-12-07T12:32:46.456647Z DEBUG vector::topology::builder: Building new transform. component=XXXXXX
Dec 07 12:32:46 vector-01 vector[610422]: 2023-12-07T12:32:46.463062Z DEBUG vector::topology::builder: Building new transform. component=XXXXXX
Dec 07 12:32:46 vector-01 vector[610422]: 2023-12-07T12:32:46.474103Z DEBUG vector::topology::builder: Building new sink. component=XXXXXX
Dec 07 12:32:46 vector-01 vector[610422]: √ Component configuration
Dec 07 12:32:46 vector-01 vector[610422]: 2023-12-07T12:32:46.479222Z  INFO vector::topology::builder: Healthcheck passed.
Dec 07 12:32:46 vector-01 vector[610422]: √ Health check "XXXXXX"
Dec 07 12:34:16 vector-01 systemd[1]: vector.service: start-pre operation timed out. Terminating.
Dec 07 12:34:16 vector-01 vector[610422]: 2023-12-07T12:34:16.484450Z  INFO vector::signal: Signal received. signal="SIGTERM"
Dec 07 12:35:00 vector-01 systemd[1]: vector.service: Unit cannot be reloaded because it is inactive.

Configuration

/lib/systemd/system/vector.service

[Unit]
Description=Vector
Documentation=https://vector.dev
After=network-online.target
Requires=network-online.target

[Service]
User=vector
Group=vector
ExecStartPre=/usr/bin/vector validate
ExecStart=/usr/bin/vector
ExecReload=/usr/bin/vector validate
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
AmbientCapabilities=CAP_NET_BIND_SERVICE
EnvironmentFile=-/etc/default/vector
# Since systemd 229, should be in [Unit] but in order to support systemd <229,
# it is also supported to have it here.
StartLimitInterval=10
StartLimitBurst=5
[Install]
WantedBy=multi-user.target

-------------------------------
/etc/default/vector

VECTOR_CONFIG_DIR=/etc/vector/conf.d
VECTOR_LOG=debug

-------------------------------
ll /etc/vector/
total 20
drwxr-xr-x  5 root root 4096 Dec  7 12:35 ./
drwxr-xr-x 99 root root 4096 Dec  7 12:35 ../
drwxr-xr-x  2 root root 4096 Nov 17 09:59 conf.d/
drwxr-xr-x  2 root root 4096 Dec  7 12:35 examples/
drwxr-xr-x  2 root root 4096 Nov 17 10:01 vrl/

Version

vector 0.33.1 (x86_64-unknown-linux-gnu 3cc27b9 2023-10-30 16:50:49.747931844)

Debug Output

2023-12-07T12:43:26.802362Z DEBUG vector::app: Internal log rate limit configured. internal_log_rate_secs=10
2023-12-07T12:43:26.802410Z  INFO vector::app: Log level is enabled. level="vector=trace,codec=trace,vrl=trace,file_source=trace,tower_limit=trace,rdkafka=trace,buffers=trace,lapin=trace,kube=trace"
2023-12-07T12:43:26.802515Z DEBUG vector::app: messaged="Building runtime." worker_threads=48
2023-1207T12:43:26.806507Z  WARN vector::config::loading: The "/etc/vector/vector.toml" config path does not exist. Vector will attempt to use new default "/etc/vector/vector.yaml" instead.
2023-12-07T12:43:26.806579Z ERROR vector::config::loading: Config file not found in path. path="/etc/vector/vector.yaml"

Example Data

No response

Additional Context

Vector machine is an LXC host.

References

No response

pront commented 11 months ago

Hi @LinoVieira13,

For vector 0.34.1 I cannot reproduce what you describe:

hoot:~$ which vector
/usr/bin/vector

host:~$ vector --version
vector 0.34.1 (x86_64-unknown-linux-gnu 86f1c22 2023-11-16 14:59:10.486846964)

host:~$ VECTOR_CONFIG_DIR=~/vector-config-dir/ vector
2023-12-08T10:21:15.402659Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-12-08T10:21:15.404428Z  INFO vector::app: Loading configs. paths=["/home/bits/vector-config-dir"]
2023-12-08T10:21:15.407402Z  INFO vector::topology::running: Running healthchecks.
2023-12-08T10:21:15.407471Z  INFO vector: Vector has started. debug="false" version="0.34.1" arch="x86_64" revision="86f1c22 2023-11-16 14:59:10.486846964"
2023-12-08T10:21:15.407487Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
2023-12-08T10:21:15.407646Z  INFO vector::topology::builder: Healthcheck passed.
{"appname":"AnthraX","facility":"local2","hostname":"up.cn","message":"Take a breath, let it go, walk away","msgid":"ID52","procid":1840,"severity":"crit","timestamp":"2023-12-08T10:21:15.408Z","version":1}
{"appname":"b0rnc0nfused","facility":"alert","hostname":"some.xn--fzys8d69uvgm","message":"Great Scott! We're never gonna reach 88 mph with the flux capacitor in its current state!","msgid":"ID717","procid":5058,"severity":"crit","timestamp":"2023-12-08T10:21:16.408Z","version":2}
{"appname":"KarimMove","facility":"user","hostname":"names.ls","message":"You're not gonna believe what just happened","msgid":"ID94","procid":5582,"severity":"debug","timestamp":"2023-12-08T10:21:17.408Z","version":1}
^C2023-12-08T10:21:17.796252Z  INFO vector::signal: Signal received. signal="SIGINT"
2023-12-08T10:21:17.796317Z  INFO vector: Vector has stopped.
2023-12-08T10:21:17.797442Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="print, parse_logs, dummy_logs" time_remaining="59 seconds left"
{"appname":"jesseddy","facility":"local0","hostname":"we.associates","message":"Maybe we just shouldn't use computers","msgid":"ID559","procid":8294,"severity":"err","timestamp":"2023-12-08T10:21:18.408Z","version":2}

host:~$ VECTOR_CONFIG_DIR=~/vector-config-dir/ vector validate
√ Loaded ["/home/bits/vector-config-dir"]
x Could not create subdirectory "validate_tmp" inside of data dir "/var/lib/vector/": Permission denied (os error 13)

host:~$ sudo VECTOR_CONFIG_DIR=~/vector-config-dir/ vector validate
√ Loaded ["/home/bits/vector-config-dir"]
√ Component configuration
√ Health check "print"
-----------------------------------------
                                Validated

Is the Debug Output you shared above coming from vector 0.33.1 or vector 0.34.1?

LinoVieira13 commented 11 months ago

Hello @pront, The Debug output I shared is coming from vector 0.34.1. However, if I try the same in vector 0.33.1 the only difference in the debug output is the line 2023-1207T12:43:26.806507Z WARN vector::config::loading: The "/etc/vector/vector.toml" config path does not exist. Vector will attempt to use new default "/etc/vector/vector.yaml" instead. the rest stays the same.

fisheax commented 11 months ago

I had met the similar situation. The same point is that vector validate ... does not quit if without --no-environment in my case. But it doesn't appear all the time. When it occurs, I can see that the health checks are all passed but stuck without output the last line --- \n Validated. Also, I noticed that the kafka connections still being kept (I'm using the kafka as both source and sink, maybe using kafka-sink would cause the problem more often). Btw, I cannot use ctrl+c to quit the process. The same configurations are worked in the v0.32.1

pront commented 11 months ago

@LinoVieira13 in order to help us debug this:

r3code commented 7 months ago

I met the same bug. I tried to upgrade from 0.24.1 to 0.35.0 (aslo for latest 0.37) and it failes to start at PreStart stage. I tried with one vector.toml and one source + one sink + one file with 5-7 trfansforms and it succeeds. When I add more files and more sinks (about 50) validate stucks after √ Component configuration ~ Health checks are disabled

Started it with VECTOR_LOG=DEBUG with strace and it also stucks with my huge config, Ctrl+C can't stop it but it shows log messages INFO vector::signal: Signal received. signal="SIGINT". I only can exit with Ctrl+Z and then kill the proc by kill -9 vector_validate_pid.

For same set of transforms run 'vector validate -C config_dir'

| Tried version | Result |
| 0.32.0-1        | OK      |
| 0.33.1-1        | OK      |
| 0.34.0-1        | FAIL    |
| 0.34.1-1        | FAIL    |
| 0.34.2-1        | FAIL    |
| 0.35.0-1        | FAIL    |
| 0.35.1-1        | FAIL    |
| 0.36.0-1        | FAIL    |
| 0.36.1-1        | FAIL    |
| 0.37.0-1        | FAIL    |
r3code commented 7 months ago

What I see with strace of vector validate without --no-environment:

openat(AT_FDCWD, "/usr/lib/ssl/certs/ca-certificates.crt", O_RDONLY) = 33
fstat(33, {st_mode=S_IFREG|0644, st_size=208567, ...}) = 0
read(33, "-----BEGIN CERTIFICATE-----\nMIIH"..., 4096) = 4096
...<excluded>
read(33, "", 4096)                      = 0
close(33)                               = 0
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
futex(0x7f05649fbd98, FUTEX_WAKE_PRIVATE, 1) = 1
write(1, "\33[32m\342\210\232\33[0m Component configura"..., 37) = 37
write(1, "\33[33m~\33[0m Health checks are dis"..., 38) = 38
futex(0x7f05695ebc50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f05695ebc00, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f0569460028, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 
// ^^^^ it stucks here
//
// -- NOW 2 attempts to stop with Ctrl+C, then terminate by Ctrl-Z
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(7, "\1", 1)                       = 1
rt_sigreturn({mask=[]})                 = 202
futex(0x7f0569460028, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(7, "\1", 1)                       = 1
rt_sigreturn({mask=[]})                 = 202
futex(0x7f0569460028, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=200774, si_uid=1027} ---
futex(0x7f0569460028, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = ?
+++ killed by SIGKILL +++
jszwedko commented 7 months ago

Can someone experiencing this failure share the Vector configuration they are trying to load and exact command they are running?

r3code commented 7 months ago

Forgot to mention - I run vector under Ubuntu 20.04.04 LTS x64. Vector installed by apt, versions are x86_64-unknown-linux-gnu

r3code commented 7 months ago

Can't share my config files right now. I'll try to split and test it, i think It might be an issue due to the files count, or a topolgy issue

Skunnyk commented 7 months ago

Anybody with the problem on stuck validate, can you try to add a delay on the network interface following https://github.com/vectordotdev/vector/issues/20367#issuecomment-2075069444 and run validate again ? :exploding_head: