vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.63k stars 1.55k forks source link

Possible Memory Leak with File source and Kafka sink #11025

Closed leandrojmp closed 1 year ago

leandrojmp commented 2 years ago

Community Note

Vector Version

vector 0.19.0 (x86_64-unknown-linux-gnu da60b55 2021-12-28)

Vector Configuration File

[sources.http_REDACTED]
data_dir = "/var/opt/vector/dbs"
type = "file"
include = ["/var/opt/logs-http-cf/json/REDACTED/*/*/*.json"]
read_from = "beginning"
ignore_older_secs = 43200
remove_after_secs = 1
oldest_first = true
ignore_checkpoints = true
fingerprint.strategy = "device_and_inode"

[sinks.kafka_REDACTED]
type = "kafka"
inputs = [ "http_REDACTED" ]
bootstrap_servers = "kafka-brokers-list"
message_timeout_ms = 300_000
socket_timeout_ms = 60_000
topic = "REDACTED"
compression = "gzip"
encoding.codec = "text"
buffer.max_events = 1000

Debug Output

Expected Behavior

Vector periodically releasing memory

Actual Behavior

Vector memory usage increases overtime

Additional Context

Hello, we are migrating some simple data pipelines from logstash to vector, these pipelines just reads json files and send the events in the json to a kafka topic, the files are cloudflare http requests and firewall events logs, so we have something around 15 to 20 pipelines.

At the moment we migrated two of those pipelines using the configuration shared before, we are using one .toml for each source, and it is working as expected, the only issue we find is that the memory usage of the vector process is increasing with the time and if we let the service running for a couple of days it will eventually consume all the memory on the server.

The logs are collected using custom python scripts that only download the json files and put them on a folder for vector to consume, the python scripts are called using crontab each minute, vector runs as a systemd process.

Vector is the only service (besides the systems services) running on this server, the load and cpu usage is pretty low, the server runs on GCP and have 8 vCPUs and 8 GB (7.63 GB), with vector stopped the memory usage is around 900 MB, when we start vector the memory starts increasing and will only be released if we restart the vector service, we tried reload with kill -1 PID, but it didn't have any effect.

vector-leak

References

3JIou-home commented 2 years ago

Community Note

* Please vote on this issue by adding a πŸ‘ [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request

* If you are interested in working on this issue or have submitted a pull request, please leave a comment

Vector Version

vector 0.19.0 (x86_64-unknown-linux-gnu da60b55 2021-12-28)

Vector Configuration File

[sources.http_REDACTED]
data_dir = "/var/opt/vector/dbs"
type = "file"
include = ["/var/opt/logs-http-cf/json/REDACTED/*/*/*.json"]
read_from = "beginning"
ignore_older_secs = 43200
remove_after_secs = 1
oldest_first = true
ignore_checkpoints = true
fingerprint.strategy = "device_and_inode"

[sinks.kafka_REDACTED]
type = "kafka"
inputs = [ "http_REDACTED" ]
bootstrap_servers = "kafka-brokers-list"
message_timeout_ms = 300_000
socket_timeout_ms = 60_000
topic = "REDACTED"
compression = "gzip"
encoding.codec = "text"
buffer.max_events = 1000

Debug Output

Expected Behavior

Vector periodically releasing memory

Actual Behavior

Vector memory usage increases overtime

Additional Context

Hello, we are migrating some simple data pipelines from logstash to vector, these pipelines just reads json files and send the events in the json to a kafka topic, the files are cloudflare http requests and firewall events logs, so we have something around 15 to 20 pipelines.

At the moment we migrated two of those pipelines using the configuration shared before, we are using one .toml for each source, and it is working as expected, the only issue we find is that the memory usage of the vector process is increasing with the time and if we let the service running for a couple of days it will eventually consume all the memory on the server.

The logs are collected using custom python scripts that only download the json files and put them on a folder for vector to consume, the python scripts are called using crontab each minute, vector runs as a systemd process.

Vector is the only service (besides the systems services) running on this server, the load and cpu usage is pretty low, the server runs on GCP and have 8 vCPUs and 8 GB (7.63 GB), with vector stopped the memory usage is around 900 MB, when we start vector the memory starts increasing and will only be released if we restart the vector service, we tried reload with kill -1 PID, but it didn't have any effect.

vector-leak

References

did u try run vector with -qqq cli arguments? i have a strong feeling that this problem into internal log level

leandrojmp commented 2 years ago

did u try run vector with -qqq cli arguments? i have a strong feeling that this problem into internal log level

I'm running as a service with the following configuration to suppress the INFO logs and only get from WARN or higher

ExecStart=/usr/bin/vector -c /etc/vector/pipelines/*.toml -q

I will change to -qq to only get ERRORS to see if something changes.

Would -qqq suppress everything or just log the FATAL level?

jszwedko commented 2 years ago

Hi @leandrojmp !

I'll try to reproduce this. Could you provide some more details about input? Number of files that typically match the glob pattern, average line length in those files, example line content?

You mentioned running multiple pipelines. For the memory growth you are seeing, is the above config the only thing a given Vector instance is running? Or do you have a single Vector running multiple of these pipelines? If the latter, could you provide a full sample config?

I just want to try, as closely as possible, to match your environment.

Also, would you be open to running Vector under valgrind if we provided a debug build? That would probably help narrow down where the allocations are going pretty quickly. Are you just running Vector natively on the host? Or one of the docker images?

leandrojmp commented 2 years ago

Hello @jszwedko !

Let me try to give more context about this data.

data

The source of this data are logs from Cloudflare HTTP requests that are sent by Cloudflare to buckets in a cloud service, those files are downloaded to the server, a CentOS 8 VM, using a custom python script that is scheduled in the crontab, the glob mirrors the structure of the buckets.

I have to collect logs from multiple companies and each company can have multiple domains, the structure of the path used in the glob is like the following:

/var/opt/logs-http-cf/json/companyA/domainA/YYYYMMdd/*.json
/var/opt/logs-http-cf/json/companyA/domainB/YYYYMMdd/*.json
/var/opt/logs-http-cf/json/companyA/domainC/YYYYMMdd/*.json

/var/opt/logs-http-cf/json/companyB/domainX/YYYYMMdd/*.json
/var/opt/logs-http-cf/json/companyB/domainY/YYYYMMdd/*.json

configuration

Since I'm coming from a Logstash background, I tried to replicate the config file organization I had.

In logstash I used the pipelines.yml to configure one pipeline for each company, in this example the glob in the configuration for companyA would be /var/opt/logs-http-cf/json/companyA/*/*/*.json and for companyB it would be /var/opt/logs-http-cf/json/companyB/*/*/*.json, I'm following the same approach with vector configurations.

And since vector does not have nothing like the pipelines.yml used by logstash, to keep using one configuration per company I've created different .toml files, one for each company just changing the input and sink names and the glob, as the example below:

companyA.toml

[sources.http_companyA]
data_dir = "/var/opt/vector/dbs"
type = "file"
include = ["/var/opt/logs-http-cf/json/companyA/*/*/*.json"]
read_from = "beginning"
ignore_older_secs = 43200
remove_after_secs = 1
oldest_first = true
ignore_checkpoints = true
fingerprint.strategy = "device_and_inode"

[sinks.kafka_companyA]
type = "kafka"
inputs = [ "http_companyA" ]
bootstrap_servers = "kafka-brokers-list"
message_timeout_ms = 300_000
socket_timeout_ms = 60_000
topic = "http-companyA"
compression = "gzip"
encoding.codec = "text"
buffer.max_events = 1000

The companyB.toml would be the same, just changing every companyA to companyB.

I'm running just one vector instance, as a systemd service with the following configuration in the ExecStart.

ExecStart=/usr/bin/vector -c /etc/vector/pipelines/*.toml -qq

Inside /etc/vector/pipelines/ I have companyA.toml and companyB.toml, it is also running the dummy configuration in /etc/vector/vector.toml, I tried to remove it, but vector does not start without it.

files and document

The python script runs every minute and it downloads an average of ~ 500 files that will have an combined size around ~ 200 MB, the files have one json document per line.

The average line size is 2.5 KB and this is an example of how the lines looks like

{"BotScore":12,"BotScoreSrc":"Machine Learning","BotTags":["api"],"CacheCacheStatus":"dynamic","CacheResponseBytes":5439,"CacheResponseStatus":200,"CacheTieredFill":false,"ClientASN":"REDACTED","ClientCountry":"br","ClientDeviceType":"tablet","ClientIP":"REDACTED","ClientIPClass":"noRecord","ClientMTLSAuthCertFingerprint":"","ClientMTLSAuthStatus":"unknown","ClientRequestBytes":5753,"ClientRequestHost":"REDACTED","ClientRequestMethod":"GET","ClientRequestPath":"REDACTED","ClientRequestProtocol":"HTTP/2","ClientRequestReferer":"","ClientRequestScheme":"https","ClientRequestSource":"eyeball","ClientRequestURI":"REDACTED","ClientRequestUserAgent":"REDACTED","ClientSSLCipher":"AEAD-AES128-GCM-SHA256","ClientSSLProtocol":"TLSv1.3","ClientSrcPort":39008,"ClientTCPRTTMs":24,"ClientXRequestedWith":"","EdgeCFConnectingO2O":false,"EdgeColoCode":"SSA","EdgeColoID":416,"EdgeEndTimestamp":"2022-01-26T23:59:42Z","EdgePathingOp":"wl","EdgePathingSrc":"macro","EdgePathingStatus":"nr","EdgeRateLimitAction":"","EdgeRateLimitID":0,"EdgeRequestHost":"REDACTED","EdgeResponseBodyBytes":805,"EdgeResponseBytes":1854,"EdgeResponseCompressionRatio":4.63,"EdgeResponseContentType":"application/json","EdgeResponseStatus":200,"EdgeServerIP":"REDACTED","EdgeStartTimestamp":"2022-01-26T23:59:42Z","EdgeTimeToFirstByteMs":210,"FirewallMatchesActions":[],"FirewallMatchesRuleIDs":[],"FirewallMatchesSources":[],"OriginDNSResponseTimeMs":0,"OriginIP":"REDACTED","OriginRequestHeaderSendDurationMs":0,"OriginResponseBytes":0,"OriginResponseDurationMs":201,"OriginResponseHTTPExpires":"","OriginResponseHTTPLastModified":"","OriginResponseHeaderReceiveDurationMs":201,"OriginResponseStatus":200,"OriginResponseTime":201000000,"OriginSSLProtocol":"TLSv1.2","OriginTCPHandshakeDurationMs":0,"OriginTLSHandshakeDurationMs":0,"ParentRayID":"00","RayID":"REDACTED","SecurityLevel":"med","SmartRouteColoID":0,"UpperTierColoID":0,"WAFAction":"unknown","WAFFlags":"0","WAFMatchedVar":"","WAFProfile":"unknown","WAFRuleID":"","WAFRuleMessage":"","WorkerCPUTime":0,"WorkerStatus":"unknown","WorkerSubrequest":false,"WorkerSubrequestCount":0,"ZoneID":"REDACTED","ZoneName":"REDACTEDDOMAIN"}

If you need more information about the format of the documents, it can be found in the cloudflare documentation.

I'm not doing any parsing with vector, just reading the lines and sending to kafka, the parsing is still being done by Logstash consuming from kafka.

At the moment I have a total of 40 different pipelines, 20 for HTTP Requests, and 20 for Firewall Events, that have a similar document/size, just 2 of those pipelines are running on vector and the rest are still in logstash.

I'm planning in keep the same organization I have with logstash, one configuration file for each pipeline, so it would be at least 40 .toml files in the config path after I migrated everything. (Does this make any difference for vector?)

After vector reads the files and send the content to kafka, it deletes the files from disk

The processing of the files is really fast, when I migrated those 2 first pipelines from Logstash to Vector, I had a backlog of files matching the globs, something around 50k files, vector had no issue processing them, no sudden memory increase, the only issue at the moment is that the memory keeps increasing with the time.

debug

To run vector with valgrind I would need to check with the rest of the security team if we can do this test.

Also, I don't know if it would help, but I could generate a strace output for the vector process.

I'm running vector direct in the host, it was installed using the x86_64 rpm package and I just updated to version 19.1.

I also could migrate more of my pipelines from Logstash to Vector to see if this has any influence in the memory increase.

Hope the explanation helps!

jszwedko commented 2 years ago

Hi @leandrojmp !

Thanks for the additional details! I talked with a couple of team members who mentioned:

leandrojmp commented 2 years ago

Would this two screenshots help?

top command

top-vector

the time in this image is in UTC

monitoring graph

graph-vector

the time in this image is in UTC-3

The fall in the graph is when I restarted vector after I updated to 19.1.

As you can see both lines slowly increase with time.

I could test running with cgroups later, I think that I will let it running for the next couple of days to see what will happen when it starts approaching the memory limit.

leandrojmp commented 2 years ago

Just an update, I'm letting vector runs to see what will happen, the memory still has the same behaviour, it is increasing in small steps, right now it is at 70%.

leandrojmp commented 2 years ago

Yeah, it looks like a memory leak, the vector service just got killed today.

It seems that vector killed itself instead of the kernel killing it, but I may be wrong, @jszwedko.

Below is the system logs:

Feb 11 22:19:35 REDACTED-SERVER kernel: vector-worker invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 11 22:19:35 REDACTED-SERVER kernel: vector-worker cpuset=/ mems_allowed=0
Feb 11 22:19:35 REDACTED-SERVER kernel: CPU: 2 PID: 29484 Comm: vector-worker Tainted: G               ------------ T 3.10.0-1160.49.1.el7.x86_64 #1
Feb 11 22:19:35 REDACTED-SERVER kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Feb 11 22:19:35 REDACTED-SERVER kernel: Call Trace:
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa5383539>] dump_stack+0x19/0x1b
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa537e5d8>] dump_header+0x90/0x229
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4f0f1db>] ? cred_has_capability+0x6b/0x120
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dc254d>] oom_kill_process+0x2cd/0x490
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4f0f2be>] ? selinux_capable+0x2e/0x40
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dc2c3a>] out_of_memory+0x31a/0x500
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dc9854>] __alloc_pages_nodemask+0xad4/0xbe0
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4e193b8>] alloc_pages_current+0x98/0x110
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dbe007>] __page_cache_alloc+0x97/0xb0
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dc0fa0>] filemap_fault+0x270/0x420
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffc035591e>] __xfs_filemap_fault+0x7e/0x1d0 [xfs]
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffc0355b1c>] xfs_filemap_fault+0x2c/0x30 [xfs]
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4dee7aa>] __do_fault.isra.61+0x8a/0x100
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4deed5c>] do_read_fault.isra.63+0x4c/0x1b0
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa4df65a0>] handle_mm_fault+0xa20/0xfb0
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa5390653>] __do_page_fault+0x213/0x500
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa5390975>] do_page_fault+0x35/0x90
Feb 11 22:19:35 REDACTED-SERVER kernel: [<ffffffffa538c778>] page_fault+0x28/0x30
Feb 11 22:19:35 REDACTED-SERVER kernel: Mem-Info:
Feb 11 22:19:35 REDACTED-SERVER kernel: active_anon:1843735 inactive_anon:35065 isolated_anon:0#012 active_file:3224 inactive_file:6913 isolated_file:0#012 unevictable:0 dirty:0 writeback:0 unstable:0#012 slab_reclaimable:6052 slab_unreclaimable:10522#012 mapped:3590 shmem:102725 pagetables:5837 bounce:0#012 free:26329 free_pcp:2235 free_cma:0
Feb 11 22:19:35 REDACTED-SERVER kernel: Node 0 DMA free:15836kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15920kB managed:15836kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb 11 22:19:35 REDACTED-SERVER kernel: lowmem_reserve[]: 0 2827 7801 7801
Feb 11 22:19:35 REDACTED-SERVER kernel: Node 0 DMA32 free:44340kB min:24456kB low:30568kB high:36684kB active_anon:2659560kB inactive_anon:55296kB active_file:7580kB inactive_file:19244kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3126080kB managed:2895640kB mlocked:0kB dirty:0kB writeback:0kB mapped:2648kB shmem:166796kB slab_reclaimable:7148kB slab_unreclaimable:13436kB kernel_stack:800kB pagetables:6384kB unstable:0kB bounce:0kB free_pcp:4220kB local_pcp:616kB free_cma:0kB writeback_tmp:0kB pages_scanned:7 all_unreclaimable? no
Feb 11 22:19:36 REDACTED-SERVER kernel: lowmem_reserve[]: 0 0 4973 4973
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 Normal free:46752kB min:42992kB low:53740kB high:64488kB active_anon:4715380kB inactive_anon:84964kB active_file:5292kB inactive_file:7684kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5242880kB managed:5093352kB mlocked:0kB dirty:0kB writeback:0kB mapped:11644kB shmem:244104kB slab_reclaimable:17060kB slab_unreclaimable:28652kB kernel_stack:3184kB pagetables:16964kB unstable:0kB bounce:0kB free_pcp:4236kB local_pcp:668kB free_cma:0kB writeback_tmp:0kB pages_scanned:20763 all_unreclaimable? yes
Feb 11 22:19:36 REDACTED-SERVER kernel: lowmem_reserve[]: 0 0 0 0
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 2*32kB (U) 2*64kB (U) 2*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15836kB
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 DMA32: 176*4kB (UEM) 319*8kB (UEM) 681*16kB (UE) 271*32kB (UEM) 181*64kB (UEM) 70*128kB (UEM) 5*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44648kB
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 Normal: 1475*4kB (UEM) 624*8kB (UEM) 600*16kB (UEM) 440*32kB (UEM) 123*64kB (UEM) 33*128kB (UEM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46924kB
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Feb 11 22:19:36 REDACTED-SERVER kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Feb 11 22:19:36 REDACTED-SERVER kernel: 112873 total pagecache pages
Feb 11 22:19:36 REDACTED-SERVER kernel: 0 pages in swap cache
Feb 11 22:19:36 REDACTED-SERVER kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb 11 22:19:36 REDACTED-SERVER kernel: Free swap  = 0kB
Feb 11 22:19:36 REDACTED-SERVER kernel: Total swap = 0kB
Feb 11 22:19:36 REDACTED-SERVER kernel: 2096220 pages RAM
Feb 11 22:19:36 REDACTED-SERVER kernel: 0 pages HighMem/MovableOnly
Feb 11 22:19:36 REDACTED-SERVER kernel: 95013 pages reserved
Feb 11 22:19:36 REDACTED-SERVER kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Feb 11 22:19:36 REDACTED-SERVER kernel: [  354]     0   354    13861     4277      32        0             0 systemd-journal
Feb 11 22:19:36 REDACTED-SERVER kernel: [  381]     0   381    12058      471      25        0         -1000 systemd-udevd
Feb 11 22:19:36 REDACTED-SERVER kernel: [  476]     0   476    13883      112      27        0         -1000 auditd
Feb 11 22:19:36 REDACTED-SERVER kernel: [  508]    81   508    16612      182      33        0          -900 dbus-daemon
Feb 11 22:19:36 REDACTED-SERVER kernel: [  513]     0   513     1097       35       8        0             0 acpid
Feb 11 22:19:36 REDACTED-SERVER kernel: [  515]   999   515   153251     2553      64        0             0 polkitd
Feb 11 22:19:36 REDACTED-SERVER kernel: [  526]   998   526    24067       77      18        0             0 chronyd
Feb 11 22:19:36 REDACTED-SERVER kernel: [  533]     0   533    27551       32      13        0             0 agetty
Feb 11 22:19:36 REDACTED-SERVER kernel: [  535]     0   535    27551       33      10        0             0 agetty
Feb 11 22:19:36 REDACTED-SERVER kernel: [  567]     0   567    89683     5616      98        0             0 firewalld
Feb 11 22:19:36 REDACTED-SERVER kernel: [  570]     0   570   137007      532      84        0             0 NetworkManager
Feb 11 22:19:36 REDACTED-SERVER kernel: [  695]     0   695    25747      533      55        0             0 dhclient
Feb 11 22:19:36 REDACTED-SERVER kernel: [  923]     0   923   143570     2829      99        0             0 tuned
Feb 11 22:19:36 REDACTED-SERVER kernel: [  925]     0   925    28953     3462      17        0          -999 google_guest_ag
Feb 11 22:19:36 REDACTED-SERVER kernel: [  926]     0   926   185420   168565     342        0             0 google_osconfig
Feb 11 22:19:36 REDACTED-SERVER kernel: [  927]     0   927   141929     3633     158        0             0 rsyslogd
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 1133]     0  1133    28246      257      56        0         -1000 sshd
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 1176]     0  1176     6596       84      18        0             0 systemd-logind
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 1228]     0  1228    31605      171      19        0             0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 1344]     0  1344    22447      259      43        0             0 master
Feb 11 22:19:36 REDACTED-SERVER kernel: [ 1364]    89  1364    22517      267      45        0             0 qmgr
Feb 11 22:19:36 REDACTED-SERVER kernel: [16769]     0 16769   226034      287      27        0             0 newrelic-infra-
Feb 11 22:19:36 REDACTED-SERVER kernel: [16777]     0 16777   361476     4475      63        0             0 newrelic-infra
Feb 11 22:19:36 REDACTED-SERVER kernel: [29482]   997 29482  2708265  1526708    3936        0             0 vector
Feb 11 22:19:36 REDACTED-SERVER kernel: [21550]    89 21550    22473      252      45        0             0 pickup
Feb 11 22:19:36 REDACTED-SERVER kernel: [28110]     0 28110    44081      233      42        0             0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [28112]     0 28112    28320       54      13        0             0 run-parts
Feb 11 22:19:36 REDACTED-SERVER kernel: [28124]     0 28124   412099    55771     249        0             0 yum-cron
Feb 11 22:19:36 REDACTED-SERVER kernel: [28125]     0 28125    28410       37      13        0             0 awk
Feb 11 22:19:36 REDACTED-SERVER kernel: [29095]     0 29095    32665      175      20        0             0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29096]     0 29096    32665      175      20        0             0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29097]     0 29097    32665      175      20        0             0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29102]     0 29102    32665      178      20        0             0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29103]     0 29103    32665      178      20        0             0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29105]     0 29105    32665      175      20        0             0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29109]     0 29109    32122      179      18        0             0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: [29110]     0 29110    32122      179      18        0             0 crond
Feb 11 22:19:36 REDACTED-SERVER kernel: Out of memory: Kill process 29482 (vector) score 764 or sacrifice child
Feb 11 22:19:36 REDACTED-SERVER kernel: Killed process 29482 (vector), UID 997, total-vm:10833060kB, anon-rss:6106832kB, file-rss:0kB, shmem-rss:0kB
Feb 11 22:19:36 REDACTED-SERVER dhclient[695]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER dhclient[695]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER NetworkManager[570]: REDACTED
Feb 11 22:19:36 REDACTED-SERVER systemd: vector.service: main process exited, code=killed, status=9/KILL
Feb 11 22:19:37 REDACTED-SERVER dbus[508]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Feb 11 22:19:37 REDACTED-SERVER systemd: Unit vector.service entered failed state.
Feb 11 22:19:37 REDACTED-SERVER dhclient[695]: bound to REDACTED -- renewal in 1045 seconds.
Feb 11 22:19:37 REDACTED-SERVER systemd: vector.service failed.

I updated now to version 0.20 and will monitor if it has the same behavior.

leandrojmp commented 2 years ago

The version 0.20.0 has the same behavior, the memory keeps increasing in small amounts until it is full and vector is killed with an out of memory error.

The workaround at the momento is to restart the vector server daily with a cron job.

@jszwedko Is there any other kind of information I could gather to help track this issue?

jszwedko commented 2 years ago

Hi @leandrojmp ,

I think this might actually be the same issue as #11995 . Do you observe the internal_metrics_cardinality_total monotonically increasing?

leandrojmp commented 2 years ago

Hello,

How do I check that? I'm not monitoring the vector process. Is there a command to get stats from a running vector process?

jszwedko commented 2 years ago

Hello,

How do I check that? I'm not monitoring the vector process. Is there a command to get stats from a running vector process?

Hey! To collect that metric, you will need to add the internal_metrics source and then feed that to a metrics output like prometheus_exporter (which you could then curl to get the metrics).

That is an interesting idea though, to allow interrogating that without wiring up the internal_metrics source. We could enable that through the GraphQL API that already exists.

leandrojmp commented 2 years ago

Hello,

I created the following pipeline to get the metrics.

[sources.vector_metrics]
type = "internal_metrics"
scrape_interval_secs = 10

[sinks.kafka_metrics]
type = "kafka"
inputs = [ "vector_metrics" ]
bootstrap_servers = "kafka-brokers"
message_timeout_ms = 300_000
socket_timeout_ms = 60_000
topic = "vector-metrics"
compression = "gzip"
encoding.codec = "text"
buffer.max_events = 1000

Consuming the data from kafka I saw that the internal_metrics_cardinality_total value is different according to the source and is always increasing.

jszwedko commented 2 years ago

Hi @leandrojmp ,

Gotcha, this does sound like it is the same issue as https://github.com/vectordotdev/vector/issues/11995 then, thanks for verifying! We plan to address that issue in the coming quarter so it should resolve this too.

jszwedko commented 2 years ago

I'll close this issue since we are tracking in https://github.com/vectordotdev/vector/issues/11995 . Please follow along there.

jszwedko commented 1 year ago

Reopening based on https://github.com/vectordotdev/vector/issues/14789#issuecomment-1292347011

jszwedko commented 1 year ago

We believe this to be closed by https://github.com/vectordotdev/vector/pull/18634

leandrojmp commented 1 year ago

@jszwedko the next release, 0.33, right?

I've just updated to 0.32, but still didn't check the memory, we have a restart script on cron, but will wait for 0.33 to confirm it.

jszwedko commented 1 year ago

@jszwedko the next release, 0.33, right?

I've just updated to 0.32, but still didn't check the memory, we have a restart script on cron, but will wait for 0.33 to confirm it.

Ah, yes, that is correct. The expected fix will be included in v0.33.0 which is slated for this week.