Closed leandrojmp closed 1 year ago
Community Note
* Please vote on this issue by adding a π [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request * If you are interested in working on this issue or have submitted a pull request, please leave a comment
Vector Version
vector 0.19.0 (x86_64-unknown-linux-gnu da60b55 2021-12-28)
Vector Configuration File
[sources.http_REDACTED] data_dir = "/var/opt/vector/dbs" type = "file" include = ["/var/opt/logs-http-cf/json/REDACTED/*/*/*.json"] read_from = "beginning" ignore_older_secs = 43200 remove_after_secs = 1 oldest_first = true ignore_checkpoints = true fingerprint.strategy = "device_and_inode" [sinks.kafka_REDACTED] type = "kafka" inputs = [ "http_REDACTED" ] bootstrap_servers = "kafka-brokers-list" message_timeout_ms = 300_000 socket_timeout_ms = 60_000 topic = "REDACTED" compression = "gzip" encoding.codec = "text" buffer.max_events = 1000
Debug Output
Expected Behavior
Vector periodically releasing memory
Actual Behavior
Vector memory usage increases overtime
Additional Context
Hello, we are migrating some simple data pipelines from logstash to vector, these pipelines just reads json files and send the events in the json to a kafka topic, the files are cloudflare http requests and firewall events logs, so we have something around 15 to 20 pipelines.
At the moment we migrated two of those pipelines using the configuration shared before, we are using one
.toml
for each source, and it is working as expected, the only issue we find is that the memory usage of the vector process is increasing with the time and if we let the service running for a couple of days it will eventually consume all the memory on the server.The logs are collected using custom python scripts that only download the json files and put them on a folder for vector to consume, the python scripts are called using crontab each minute, vector runs as a systemd process.
Vector is the only service (besides the systems services) running on this server, the load and cpu usage is pretty low, the server runs on GCP and have 8 vCPUs and 8 GB (7.63 GB), with vector stopped the memory usage is around 900 MB, when we start vector the memory starts increasing and will only be released if we restart the vector service, we tried reload with
kill -1 PID
, but it didn't have any effect.References
did u try run vector with -qqq
cli arguments? i have a strong feeling that this problem into internal log level
did u try run vector with
-qqq
cli arguments? i have a strong feeling that this problem into internal log level
I'm running as a service with the following configuration to suppress the INFO
logs and only get from WARN
or higher
ExecStart=/usr/bin/vector -c /etc/vector/pipelines/*.toml -q
I will change to -qq
to only get ERRORS
to see if something changes.
Would -qqq
suppress everything or just log the FATAL
level?
Hi @leandrojmp !
I'll try to reproduce this. Could you provide some more details about input? Number of files that typically match the glob pattern, average line length in those files, example line content?
You mentioned running multiple pipelines. For the memory growth you are seeing, is the above config the only thing a given Vector instance is running? Or do you have a single Vector running multiple of these pipelines? If the latter, could you provide a full sample config?
I just want to try, as closely as possible, to match your environment.
Also, would you be open to running Vector under valgrind
if we provided a debug build? That would probably help narrow down where the allocations are going pretty quickly. Are you just running Vector natively on the host? Or one of the docker images?
Hello @jszwedko !
Let me try to give more context about this data.
The source of this data are logs from Cloudflare HTTP requests that are sent by Cloudflare to buckets in a cloud service, those files are downloaded to the server, a CentOS 8 VM, using a custom python script that is scheduled in the crontab, the glob mirrors the structure of the buckets.
I have to collect logs from multiple companies and each company can have multiple domains, the structure of the path used in the glob is like the following:
/var/opt/logs-http-cf/json/companyA/domainA/YYYYMMdd/*.json
/var/opt/logs-http-cf/json/companyA/domainB/YYYYMMdd/*.json
/var/opt/logs-http-cf/json/companyA/domainC/YYYYMMdd/*.json
/var/opt/logs-http-cf/json/companyB/domainX/YYYYMMdd/*.json
/var/opt/logs-http-cf/json/companyB/domainY/YYYYMMdd/*.json
Since I'm coming from a Logstash background, I tried to replicate the config file organization I had.
In logstash I used the pipelines.yml
to configure one pipeline for each company, in this example the glob in the configuration for companyA
would be /var/opt/logs-http-cf/json/companyA/*/*/*.json
and for companyB
it would be /var/opt/logs-http-cf/json/companyB/*/*/*.json
, I'm following the same approach with vector configurations.
And since vector does not have nothing like the pipelines.yml
used by logstash, to keep using one configuration per company I've created different .toml
files, one for each company just changing the input and sink names and the glob, as the example below:
companyA.toml
[sources.http_companyA]
data_dir = "/var/opt/vector/dbs"
type = "file"
include = ["/var/opt/logs-http-cf/json/companyA/*/*/*.json"]
read_from = "beginning"
ignore_older_secs = 43200
remove_after_secs = 1
oldest_first = true
ignore_checkpoints = true
fingerprint.strategy = "device_and_inode"
[sinks.kafka_companyA]
type = "kafka"
inputs = [ "http_companyA" ]
bootstrap_servers = "kafka-brokers-list"
message_timeout_ms = 300_000
socket_timeout_ms = 60_000
topic = "http-companyA"
compression = "gzip"
encoding.codec = "text"
buffer.max_events = 1000
The companyB.toml
would be the same, just changing every companyA
to companyB
.
I'm running just one vector instance, as a systemd service with the following configuration in the ExecStart
.
ExecStart=/usr/bin/vector -c /etc/vector/pipelines/*.toml -qq
Inside /etc/vector/pipelines/
I have companyA.toml
and companyB.toml
, it is also running the dummy configuration in /etc/vector/vector.toml
, I tried to remove it, but vector does not start without it.
The python script runs every minute and it downloads an average of ~ 500 files that will have an combined size around ~ 200 MB, the files have one json document per line.
The average line size is 2.5 KB
and this is an example of how the lines looks like
{"BotScore":12,"BotScoreSrc":"Machine Learning","BotTags":["api"],"CacheCacheStatus":"dynamic","CacheResponseBytes":5439,"CacheResponseStatus":200,"CacheTieredFill":false,"ClientASN":"REDACTED","ClientCountry":"br","ClientDeviceType":"tablet","ClientIP":"REDACTED","ClientIPClass":"noRecord","ClientMTLSAuthCertFingerprint":"","ClientMTLSAuthStatus":"unknown","ClientRequestBytes":5753,"ClientRequestHost":"REDACTED","ClientRequestMethod":"GET","ClientRequestPath":"REDACTED","ClientRequestProtocol":"HTTP/2","ClientRequestReferer":"","ClientRequestScheme":"https","ClientRequestSource":"eyeball","ClientRequestURI":"REDACTED","ClientRequestUserAgent":"REDACTED","ClientSSLCipher":"AEAD-AES128-GCM-SHA256","ClientSSLProtocol":"TLSv1.3","ClientSrcPort":39008,"ClientTCPRTTMs":24,"ClientXRequestedWith":"","EdgeCFConnectingO2O":false,"EdgeColoCode":"SSA","EdgeColoID":416,"EdgeEndTimestamp":"2022-01-26T23:59:42Z","EdgePathingOp":"wl","EdgePathingSrc":"macro","EdgePathingStatus":"nr","EdgeRateLimitAction":"","EdgeRateLimitID":0,"EdgeRequestHost":"REDACTED","EdgeResponseBodyBytes":805,"EdgeResponseBytes":1854,"EdgeResponseCompressionRatio":4.63,"EdgeResponseContentType":"application/json","EdgeResponseStatus":200,"EdgeServerIP":"REDACTED","EdgeStartTimestamp":"2022-01-26T23:59:42Z","EdgeTimeToFirstByteMs":210,"FirewallMatchesActions":[],"FirewallMatchesRuleIDs":[],"FirewallMatchesSources":[],"OriginDNSResponseTimeMs":0,"OriginIP":"REDACTED","OriginRequestHeaderSendDurationMs":0,"OriginResponseBytes":0,"OriginResponseDurationMs":201,"OriginResponseHTTPExpires":"","OriginResponseHTTPLastModified":"","OriginResponseHeaderReceiveDurationMs":201,"OriginResponseStatus":200,"OriginResponseTime":201000000,"OriginSSLProtocol":"TLSv1.2","OriginTCPHandshakeDurationMs":0,"OriginTLSHandshakeDurationMs":0,"ParentRayID":"00","RayID":"REDACTED","SecurityLevel":"med","SmartRouteColoID":0,"UpperTierColoID":0,"WAFAction":"unknown","WAFFlags":"0","WAFMatchedVar":"","WAFProfile":"unknown","WAFRuleID":"","WAFRuleMessage":"","WorkerCPUTime":0,"WorkerStatus":"unknown","WorkerSubrequest":false,"WorkerSubrequestCount":0,"ZoneID":"REDACTED","ZoneName":"REDACTEDDOMAIN"}
If you need more information about the format of the documents, it can be found in the cloudflare documentation.
I'm not doing any parsing with vector, just reading the lines and sending to kafka, the parsing is still being done by Logstash consuming from kafka.
At the moment I have a total of 40 different pipelines, 20 for HTTP Requests, and 20 for Firewall Events, that have a similar document/size, just 2 of those pipelines are running on vector and the rest are still in logstash.
I'm planning in keep the same organization I have with logstash, one configuration file for each pipeline, so it would be at least 40 .toml
files in the config path after I migrated everything. (Does this make any difference for vector?)
After vector reads the files and send the content to kafka, it deletes the files from disk
The processing of the files is really fast, when I migrated those 2 first pipelines from Logstash to Vector, I had a backlog of files matching the globs, something around 50k files, vector had no issue processing them, no sudden memory increase, the only issue at the moment is that the memory keeps increasing with the time.
To run vector with valgrind
I would need to check with the rest of the security team if we can do this test.
Also, I don't know if it would help, but I could generate a strace
output for the vector process.
I'm running vector direct in the host, it was installed using the x86_64
rpm package and I just updated to version 19.1
.
I also could migrate more of my pipelines from Logstash to Vector to see if this has any influence in the memory increase.
Hope the explanation helps!
Hi @leandrojmp !
Thanks for the additional details! I talked with a couple of team members who mentioned:
Would this two screenshots help?
top command
the time in this image is in UTC
monitoring graph
the time in this image is in UTC-3
The fall in the graph is when I restarted vector after I updated to 19.1
.
As you can see both lines slowly increase with time.
I could test running with cgroups later, I think that I will let it running for the next couple of days to see what will happen when it starts approaching the memory limit.
Just an update, I'm letting vector runs to see what will happen, the memory still has the same behaviour, it is increasing in small steps, right now it is at 70%
.
Yeah, it looks like a memory leak, the vector
service just got killed today.
It seems that vector
killed itself instead of the kernel killing it, but I may be wrong, @jszwedko.
Below is the system logs:
Feb 11 22:19:35 REDACTED-SERVER kernel: vector-worker invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 11 22:19:35 REDACTED-SERVER kernel: vector-worker cpuset=/ mems_allowed=0
Feb 11 22:19:35 REDACTED-SERVER kernel: CPU: 2 PID: 29484 Comm: vector-worker Tainted: G ------------ T 3.10.0-1160.49.1.el7.x86_64 #1
Feb 11 22:19:35 REDACTED-SERVER kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Feb 11 22:19:35 REDACTED-SERVER kernel: Call Trace:
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa5383539>] dump_stack+0x19/0x1b
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa537e5d8>] dump_header+0x90/0x229
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4f0f1db>] ? cred_has_capability+0x6b/0x120
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dc254d>] oom_kill_process+0x2cd/0x490
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4f0f2be>] ? selinux_capable+0x2e/0x40
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dc2c3a>] out_of_memory+0x31a/0x500
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dc9854>] __alloc_pages_nodemask+0xad4/0xbe0
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4e193b8>] alloc_pages_current+0x98/0x110
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dbe007>] __page_cache_alloc+0x97/0xb0
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dc0fa0>] filemap_fault+0x270/0x420
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffc035591e>] __xfs_filemap_fault+0x7e/0x1d0 [xfs]
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffc0355b1c>] xfs_filemap_fault+0x2c/0x30 [xfs]
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dee7aa>] __do_fault.isra.61+0x8a/0x100
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4deed5c>] do_read_fault.isra.63+0x4c/0x1b0
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4df65a0>] handle_mm_fault+0xa20/0xfb0
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa5390653>] __do_page_fault+0x213/0x500
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa5390975>] do_page_fault+0x35/0x90
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa538c778>] page_fault+0x28/0x30
Feb 11 22:19:35 REDACTED-SERVER kernel: Mem-Info:
Feb 11 22:19:35 REDACTED-SERVER kernel: active_anon:1843735 inactive_anon:35065 isolated_anon:0#012 active_file:3224 inactive_file:6913 isolated_file:0#012 unevictable:0 dirty:0 writeback:0 unstable:0#012 slab_reclaimable:6052 slab_unreclaimable:10522#012 mapped:3590 shmem:102725 pagetables:5837 bounce:0#012 free:26329 free_pcp:2235 free_cma:0
Feb 11 22:19:35 REDACTED-SERVER kernel: Node 0 DMA free:15836kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15920kB managed:15836kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb 11 22:19:35 REDACTED-SERVER kernel: lowmem_reserve[]: 0 2827 7801 7801
Feb 11 22:19:35 REDACTED-SERVER kernel: Node 0 DMA32 free:44340kB min:24456kB low:30568kB high:36684kB active_anon:2659560kB inactive_anon:55296kB active_file:7580kB inactive_file:19244kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3126080kB managed:2895640kB mlocked:0kB dirty:0kB writeback:0kB mapped:2648kB shmem:166796kB slab_reclaimable:7148kB slab_unreclaimable:13436kB kernel_stack:800kB pagetables:6384kB unstable:0kB bounce:0kB free_pcp:4220kB local_pcp:616kB free_cma:0kB writeback_tmp:0kB pages_scanned:7 all_unreclaimable? no
Feb 11 22:19:36 REDACTED-SERVER kernel: lowmem_reserve[]: 0 0 4973 4973
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 Normal free:46752kB min:42992kB low:53740kB high:64488kB active_anon:4715380kB inactive_anon:84964kB active_file:5292kB inactive_file:7684kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5242880kB managed:5093352kB mlocked:0kB dirty:0kB writeback:0kB mapped:11644kB shmem:244104kB slab_reclaimable:17060kB slab_unreclaimable:28652kB kernel_stack:3184kB pagetables:16964kB unstable:0kB bounce:0kB free_pcp:4236kB local_pcp:668kB free_cma:0kB writeback_tmp:0kB pages_scanned:20763 all_unreclaimable? yes
Feb 11 22:19:36 REDACTED-SERVER kernel: lowmem_reserve[]: 0 0 0 0
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 2*32kB (U) 2*64kB (U) 2*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15836kB
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 DMA32: 176*4kB (UEM) 319*8kB (UEM) 681*16kB (UE) 271*32kB (UEM) 181*64kB (UEM) 70*128kB (UEM) 5*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44648kB
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 Normal: 1475*4kB (UEM) 624*8kB (UEM) 600*16kB (UEM) 440*32kB (UEM) 123*64kB (UEM) 33*128kB (UEM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46924kB
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Feb 11 22:19:36 REDACTED-SERVER kernel: 112873 total pagecache pages
Feb 11 22:19:36 REDACTED-SERVER kernel: 0 pages in swap cache
Feb 11 22:19:36 REDACTED-SERVER kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb 11 22:19:36 REDACTED-SERVER kernel: Free swap = 0kB
Feb 11 22:19:36 REDACTED-SERVER kernel: Total swap = 0kB
Feb 11 22:19:36 REDACTED-SERVER kernel: 2096220 pages RAM
Feb 11 22:19:36 REDACTED-SERVER kernel: 0 pages HighMem/MovableOnly
Feb 11 22:19:36 REDACTED-SERVER kernel: 95013 pages reserved
Feb 11 22:19:36 REDACTED-SERVER kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 354] 0 354 13861 4277 32 0 0 systemd-journal
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 381] 0 381 12058 471 25 0 -1000 systemd-udevd
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 476] 0 476 13883 112 27 0 -1000 auditd
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 508] 81 508 16612 182 33 0 -900 dbus-daemon
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 513] 0 513 1097 35 8 0 0 acpid
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 515] 999 515 153251 2553 64 0 0 polkitd
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 526] 998 526 24067 77 18 0 0 chronyd
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 533] 0 533 27551 32 13 0 0 agetty
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 535] 0 535 27551 33 10 0 0 agetty
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 567] 0 567 89683 5616 98 0 0 firewalld
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 570] 0 570 137007 532 84 0 0 NetworkManager
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 695] 0 695 25747 533 55 0 0 dhclient
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 923] 0 923 143570 2829 99 0 0 tuned
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 925] 0 925 28953 3462 17 0 -999 google_guest_ag
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 926] 0 926 185420 168565 342 0 0 google_osconfig
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 927] 0 927 141929 3633 158 0 0 rsyslogd
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 1133] 0 1133 28246 257 56 0 -1000 sshd
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 1176] 0 1176 6596 84 18 0 0 systemd-logind
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 1228] 0 1228 31605 171 19 0 0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 1344] 0 1344 22447 259 43 0 0 master
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 1364] 89 1364 22517 267 45 0 0 qmgr
Feb 11 22:19:36 REDACTED-SERVER kernel: [16769] 0 16769 226034 287 27 0 0 newrelic-infra-
Feb 11 22:19:36 REDACTED-SERVER kernel: [16777] 0 16777 361476 4475 63 0 0 newrelic-infra
Feb 11 22:19:36 REDACTED-SERVER kernel: [29482] 997 29482 2708265 1526708 3936 0 0 vector
Feb 11 22:19:36 REDACTED-SERVER kernel: [21550] 89 21550 22473 252 45 0 0 pickup
Feb 11 22:19:36 REDACTED-SERVER kernel: [28110] 0 28110 44081 233 42 0 0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [28112] 0 28112 28320 54 13 0 0 run-parts
Feb 11 22:19:36 REDACTED-SERVER kernel: [28124] 0 28124 412099 55771 249 0 0 yum-cron
Feb 11 22:19:36 REDACTED-SERVER kernel: [28125] 0 28125 28410 37 13 0 0 awk
Feb 11 22:19:36 REDACTED-SERVER kernel: [29095] 0 29095 32665 175 20 0 0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29096] 0 29096 32665 175 20 0 0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29097] 0 29097 32665 175 20 0 0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29102] 0 29102 32665 178 20 0 0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29103] 0 29103 32665 178 20 0 0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29105] 0 29105 32665 175 20 0 0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29109] 0 29109 32122 179 18 0 0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29110] 0 29110 32122 179 18 0 0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: Out of memory: Kill process 29482 (vector) score 764 or sacrifice child
Feb 11 22:19:36 REDACTED-SERVER kernel: Killed process 29482 (vector), UID 997, total-vm:10833060kB, anon-rss:6106832kB, file-rss:0kB, shmem-rss:0kB
Feb 11 22:19:36 REDACTED-SERVER dhclient[695]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER dhclient[695]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER systemd: vector.service: main process exited, code=killed, status=9/KILL
Feb 11 22:19:37 REDACTED-SERVER dbus[508]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Feb 11 22:19:37 REDACTED-SERVER systemd: Unit vector.service entered failed state.
Feb 11 22:19:37 REDACTED-SERVER dhclient[695]: bound to REDACTED -- renewal in 1045 seconds.
Feb 11 22:19:37 REDACTED-SERVER systemd: vector.service failed.
I updated now to version 0.20
and will monitor if it has the same behavior.
The version 0.20.0
has the same behavior, the memory keeps increasing in small amounts until it is full and vector is killed with an out of memory error.
The workaround at the momento is to restart the vector server daily with a cron job.
@jszwedko Is there any other kind of information I could gather to help track this issue?
Hi @leandrojmp ,
I think this might actually be the same issue as #11995 . Do you observe the internal_metrics_cardinality_total
monotonically increasing?
Hello,
How do I check that? I'm not monitoring the vector
process. Is there a command to get stats from a running vector process?
Hello,
How do I check that? I'm not monitoring the
vector
process. Is there a command to get stats from a running vector process?
Hey! To collect that metric, you will need to add the internal_metrics
source and then feed that to a metrics output like prometheus_exporter
(which you could then curl
to get the metrics).
That is an interesting idea though, to allow interrogating that without wiring up the internal_metrics
source. We could enable that through the GraphQL API that already exists.
Hello,
I created the following pipeline to get the metrics.
[sources.vector_metrics]
type = "internal_metrics"
scrape_interval_secs = 10
[sinks.kafka_metrics]
type = "kafka"
inputs = [ "vector_metrics" ]
bootstrap_servers = "kafka-brokers"
message_timeout_ms = 300_000
socket_timeout_ms = 60_000
topic = "vector-metrics"
compression = "gzip"
encoding.codec = "text"
buffer.max_events = 1000
Consuming the data from kafka I saw that the internal_metrics_cardinality_total
value is different according to the source and is always increasing.
Hi @leandrojmp ,
Gotcha, this does sound like it is the same issue as https://github.com/vectordotdev/vector/issues/11995 then, thanks for verifying! We plan to address that issue in the coming quarter so it should resolve this too.
I'll close this issue since we are tracking in https://github.com/vectordotdev/vector/issues/11995 . Please follow along there.
Reopening based on https://github.com/vectordotdev/vector/issues/14789#issuecomment-1292347011
We believe this to be closed by https://github.com/vectordotdev/vector/pull/18634
@jszwedko the next release, 0.33, right?
I've just updated to 0.32, but still didn't check the memory, we have a restart script on cron, but will wait for 0.33 to confirm it.
@jszwedko the next release, 0.33, right?
I've just updated to 0.32, but still didn't check the memory, we have a restart script on cron, but will wait for 0.33 to confirm it.
Ah, yes, that is correct. The expected fix will be included in v0.33.0 which is slated for this week.
Community Note
Vector Version
Vector Configuration File
Debug Output
Expected Behavior
Vector periodically releasing memory
Actual Behavior
Vector memory usage increases overtime
Additional Context
Hello, we are migrating some simple data pipelines from logstash to vector, these pipelines just reads json files and send the events in the json to a kafka topic, the files are cloudflare http requests and firewall events logs, so we have something around 15 to 20 pipelines.
At the moment we migrated two of those pipelines using the configuration shared before, we are using one
.toml
for each source, and it is working as expected, the only issue we find is that the memory usage of the vector process is increasing with the time and if we let the service running for a couple of days it will eventually consume all the memory on the server.The logs are collected using custom python scripts that only download the json files and put them on a folder for vector to consume, the python scripts are called using crontab each minute, vector runs as a systemd process.
Vector is the only service (besides the systems services) running on this server, the load and cpu usage is pretty low, the server runs on GCP and have 8 vCPUs and 8 GB (7.63 GB), with vector stopped the memory usage is around 900 MB, when we start vector the memory starts increasing and will only be released if we restart the vector service, we tried reload with
kill -1 PID
, but it didn't have any effect.References