influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.79k stars 3.55k forks source link

OOM with Influx 2.0.7 running inside a docker container with memory constrains #21765

Open opsxcq opened 3 years ago

opsxcq commented 3 years ago

I'm facing an issue with influx 2.0.7 running in a docker container with memory constrains, I'm aware of memory tradeoffs for databases, and I'm ok with it being slower due to limited resources. I tried to limit all memory/buffer parameters that I could to keep it under control, but influx keeps dying from OOM while importing big historical datasets (1.5mi data points) into 5 measurements with about 5 tags.

Bellow my deployment code (ansible)

  - name: "Monitoring | Influxdb"
    docker_container:
      name: influx
      image: influxdb:2.0.7
      restart_policy: unless-stopped
      memory: 4g
      #memory_swap: 5
      #memory_swappiness: 5
      env:
        INFLUXDB_UDP_ENABLED: "true"
        INFLUXDB_REPORTING_DISABLED: "true"
        INFLUXDB_DATA_CACHE_MAX_MEMORY_SIZE: "2G"
        INFLUXDB_DATA_INDEX_VERSION: "tsi1"
        INFLUXDB_CONFIG_PATH: "/etc/influxdb2/influxdb.config"
      command:
        - "--storage-max-concurrent-compactions=1"
        - "--storage-series-file-max-concurrent-snapshot-compactions=1"
        - "--storage-compact-full-write-cold-duration=2h"
        - "--storage-cache-snapshot-write-cold-duration=5m"
        - "--query-max-memory-bytes=107374182"
        - "--http-read-timeout=0"
        - "--http-write-timeout=0"
        - "--bolt-path=/data/bolt"
        - "--engine-path=/data/engine"
        #- "--storage-series-id-set-cache-size=1024"
        - "--storage-retention-check-interval=48h"
        - "--query-queue-size=102400"
        - "--query-concurrency=16"
        - "--reporting-disabled=true"
      ports:
        - "8086:8086"
        - "8089:8089/udp"
      volumes:
        - "/data/backups/:/backups"
        - "influx-data:/data"
        - "/config/influx-configs:/etc/influxdb2/influx-configs"

Steps to reproduce: List the minimal actions needed to reproduce the behavior.

  1. Start influx with above command
  2. Try to load some historical datasets
  3. After a while
[605511.957732] influxd invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
[605511.957735] influxd cpuset=ebfab5d905bd35fddc9a6931e6b784d213f51c0a180292e17c1b4594ed15dffc mems_allowed=0
[605511.957746] CPU: 2 PID: 17527 Comm: influxd Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[605511.957748] Hardware name: Dell Inc. OptiPlex 3050/0JP3NX, BIOS 1.15.1 12/22/2020
[605511.957749] Call Trace:
[605511.957763]  dump_stack+0x66/0x81
[605511.957768]  dump_header+0x6b/0x283
[605511.957774]  oom_kill_process.cold.30+0xb/0x1cf
[605511.957782]  out_of_memory+0x1a5/0x450
[605511.957788]  mem_cgroup_out_of_memory+0xbe/0xd0
[605511.957794]  try_charge+0x63a/0x780
[605511.957801]  mem_cgroup_try_charge+0x86/0x190
[605511.957807]  ? pagecache_get_page+0x30/0x2c0
[605511.957812]  mem_cgroup_try_charge_delay+0x1c/0x40
[605511.957817]  do_swap_page+0x224/0x8d0
[605511.957822]  __handle_mm_fault+0x87c/0x11f0
[605511.957827]  ? __switch_to_asm+0x41/0x70
[605511.957833]  handle_mm_fault+0xd6/0x200
[605511.957838]  __do_page_fault+0x249/0x4f0
[605511.957844]  ? page_fault+0x8/0x30
[605511.957849]  page_fault+0x1e/0x30
[605511.957854] RIP: 0033:0x4272e6
[605511.957858] Code: ff c6 0f b6 3b 49 89 cb 89 f1 41 89 fc d3 ef 49 83 fb 08 74 0a 0f ba e7 04 0f 83 c3 00 00 00 41 0f a3 cc 90 73 af 4b 8d 3c 0b <48> 8b 3f 48 85 ff 74 a3 49 89 fc 4c 29 cf 48 39 d7 72 98 48 89 5c
[605511.957861] RSP: 002b:00007f65058bf850 EFLAGS: 00010247
[605511.957864] RAX: 0000000000203016 RBX: 00007f64fa460dc8 RCX: 0000000000000001
[605511.957867] RDX: 0000000000001300 RSI: 0000000000000001 RDI: 000000c05941b908
[605511.957869] RBP: 00007f65058bf8d0 R08: 00007f64fa5bffff R09: 000000c05941b900
[605511.957871] R10: 000000c000062698 R11: 0000000000000008 R12: 00000000000000da
[605511.957872] R13: 00000000058448e0 R14: 0000000000000000 R15: 0000000000000000
[605511.957876] Task in /docker/ebfab5d905bd35fddc9a6931e6b784d213f51c0a180292e17c1b4594ed15dffc killed as a result of limit of /docker/ebfab5d905bd35fddc9a6931e6b784d213f51c0a180292e17c1b4594ed15dffc
[605511.957885] memory: usage 3983680kB, limit 4194304kB, failcnt 46943
[605511.957888] memory+swap: usage 8388608kB, limit 8388608kB, failcnt 2457
[605511.957890] kmem: usage 37268kB, limit 9007199254740988kB, failcnt 0
[605511.957891] Memory cgroup stats for /docker/ebfab5d905bd35fddc9a6931e6b784d213f51c0a180292e17c1b4594ed15dffc: cache:28KB rss:3946172KB rss_huge:1671168KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:4405216KB inactive_anon:657844KB active_anon:3288512KB inactive_file:52KB active_file:4KB unevictable:0KB
[605511.957906] Tasks state (memory values in pages):
[605511.957907] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[605511.958069] [  16803]  1000 16803  2572919   997119 18964480  1103566             0 influxd
[605511.958075] Memory cgroup out of memory: Kill process 16803 (influxd) score 1003 or sacrifice child
[605511.958169] Killed process 16803 (influxd) total-vm:10291676kB, anon-rss:3935536kB, file-rss:52940kB, shmem-rss:0kB
[605513.118538] oom_reaper: reaped process 16803 (influxd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Expected behavior: Guidance on how to run influxdb with limited memory constrains without suffering such problems.

Actual behavior: Influx dies and get stuck into a restart loop due to OOM.

Environment info:

Config: Configuration is set on command line arguments on the above snippet

Logs: Logs are quite big and with a lot of very similar entries as bellow

ts=2021-06-30T06:13:52.067412Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2205 duration=38.368ms
ts=2021-06-30T06:13:52.070047Z lvl=info msg="Reading file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=cacheloader path=/data/engine/wal/591fc0b0c200d622/autogen/2150/_00001.wal size=7009272
ts=2021-06-30T06:13:52.100446Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:13:52.103596Z lvl=info msg="Opened file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=filestore path=/data/engine/data/591fc0b0c200d622/autogen/2207/000000008-000000001.tsm id=1 duration=0.427ms
ts=2021-06-30T06:13:52.105162Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:13:52.105155Z lvl=info msg="Opened file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=filestore path=/data/engine/data/591fc0b0c200d622/autogen/2207/000000006-000000002.tsm id=0 duration=1.140ms
ts=2021-06-30T06:13:52.106545Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2207 duration=38.987ms
ts=2021-06-30T06:13:52.129169Z lvl=info msg="Opened file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=filestore path=/data/engine/data/591fc0b0c200d622/autogen/2206/000000006-000000002.tsm id=0 duration=15.406ms
ts=2021-06-30T06:13:52.129476Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:13:52.129518Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2206 duration=78.383ms
ts=2021-06-30T06:13:52.157937Z lvl=info msg="Opened file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=filestore path=/data/engine/data/591fc0b0c200d622/autogen/2208/000000006-000000002.tsm id=0 duration=11.988ms
ts=2021-06-30T06:13:52.158635Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2208 duration=51.819ms
ts=2021-06-30T06:13:52.215262Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:13:52.226000Z lvl=info msg="Reading file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=cacheloader path=/data/engine/wal/591fc0b0c200d622/autogen/2151/_00001.wal size=7610704
ts=2021-06-30T06:13:52.253481Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:13:52.262175Z lvl=info msg="Reading file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=cacheloader path=/data/engine/wal/591fc0b0c200d622/autogen/2152/_00001.wal size=6853353
ts=2021-06-30T06:14:05.797081Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2149 duration=13868.619ms
ts=2021-06-30T06:14:05.870651Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:14:05.882055Z lvl=info msg="Opened file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=filestore path=/data/engine/data/591fc0b0c200d622/autogen/2209/000000007-000000001.tsm id=1 duration=0.894ms
ts=2021-06-30T06:14:05.886611Z lvl=info msg="Opened file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=filestore path=/data/engine/data/591fc0b0c200d622/autogen/2209/000000005-000000002.tsm id=0 duration=3.016ms
ts=2021-06-30T06:14:05.888572Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2209 duration=89.604ms
ts=2021-06-30T06:14:06.038251Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:14:06.047253Z lvl=info msg="Reading file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=cacheloader path=/data/engine/wal/591fc0b0c200d622/autogen/2153/_00001.wal size=6063149
ts=2021-06-30T06:14:06.194766Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2150 duration=14194.929ms
ts=2021-06-30T06:14:06.252163Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:14:06.255755Z lvl=info msg="Reading file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=cacheloader path=/data/engine/wal/591fc0b0c200d622/autogen/221/_00001.wal size=58935
ts=2021-06-30T06:14:06.334527Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/221 duration=139.432ms
ts=2021-06-30T06:14:06.445857Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2152 duration=14286.931ms
ts=2021-06-30T06:14:06.495199Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:14:06.499168Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2151 duration=14369.603ms
ts=2021-06-30T06:14:06.499803Z lvl=info msg="Reading file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=cacheloader path=/data/engine/wal/591fc0b0c200d622/autogen/2154/_00001.wal size=6955768
ts=2021-06-30T06:14:06.532590Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:14:06.556118Z lvl=info msg="Opened file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=filestore path=/data/engine/data/591fc0b0c200d622/autogen/2210/000000006-000000002.tsm id=0 duration=2.915ms
ts=2021-06-30T06:14:06.556807Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2210 duration=109.049ms
ts=2021-06-30T06:14:06.619907Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:14:06.629603Z lvl=info msg="Opened file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=filestore path=/data/engine/data/591fc0b0c200d622/autogen/2211/000000007-000000001.tsm id=1 duration=0.832ms
ts=2021-06-30T06:14:06.632960Z lvl=info msg="Opened file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=filestore path=/data/engine/data/591fc0b0c200d622/autogen/2211/000000005-000000002.tsm id=0 duration=3.081ms
ts=2021-06-30T06:14:06.633474Z lvl=info msg="Opened shard" log_id=0V2tdsJG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/data/engine/data/591fc0b0c200d622/autogen/2211 duration=76.374ms
ts=2021-06-30T06:14:06.672583Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:14:06.678932Z lvl=info msg="Reading file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=cacheloader path=/data/engine/wal/591fc0b0c200d622/autogen/2155/_00001.wal size=7561101
ts=2021-06-30T06:14:06.787156Z lvl=info msg="index opened with 8 partitions" log_id=0V2tdsJG000 service=storage-engine index=tsi
ts=2021-06-30T06:14:06.801331Z lvl=info msg="Opened file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=filestore path=/data/engine/data/591fc0b0c200d622/autogen/2156/000000002-000000002.tsm id=0 duration=11.179ms
ts=2021-06-30T06:14:06.801960Z lvl=info msg="Reading file" log_id=0V2tdsJG000 service=storage-engine engine=tsm1 service=cacheloader path=/data/engine/wal/591fc0b0c200d622/autogen/2156/_00005.wal size=9032745
#

Performance: Due to the fact that influxdb get stuck into a restart loop and don't start the http interface makes hard to run the command to get profiler information. I'm awaiting for further instructions on how to do it.

# Commands should be run when the bug is actively happening.
# Note: This command will run for ~30 seconds.
curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all?cpu=30s"
iostat -xd 1 30 > iostat.txt
# Attach the `profiles.tar.gz` and `iostat.txt` output files.
angelademarco commented 3 years ago

Same issue in my company, we were considering migrating from timescale to influx, but due to this we can't proceed with testing. Apparently timescale is much better at memory management

opsxcq commented 3 years ago

@danxmoran let me know if you need any additional information regarding this bug

angelademarco commented 3 years ago

I still having the same problem, simulated the same scenario on timescaledb with 1G of ram, it could keep up with the metrics pretty well. Looking forward to have at least the same performance with influxdb 2 on 8gb of ram.

williamhbaker commented 3 years ago

@opsxcq would you be able to tell us anything more about your historical dataset? You mention the number of series and tags in the original issue which is great information - I'm also wondering about how the data is distributed over time, etc. If you have a sample of the data that you could provide that would be great as well. Also how are you trying to import the data - using the influx CLI or some other way?

@angelademarco similar questions for you...from your description in #21766, I've been running some tests with an EC2 instance with 4GB of ram with influxd running docker and am able to OOM it with certain kinds of data (particularly if the data points are spread out over a large / randomized timeframe) but not with others. I'm synthetically generating my data of course; if you have more information you could share about the data you are experiencing these crashes with that would be very useful! Sounds like you are able to simulate the the data/crashes so it'd be great if there's anything you could provide about how you are doing that I'm sure that would be helpful too!

opsxcq commented 3 years ago

@wbaker85 yes, it is the historical market data since 1950, on daily intervals (range goes from 100 to around 8000 symbols, it chances over time) + recent data with more granularity (hourly, minutes (1,5,15)) but only for the last few months, I stil didn't try to import all my ticker data.

I split them in two measurements, one is the tickers and the other is "daily" which contains the data for daily prices.

tennox commented 2 years ago

For reference, here's a related recent issue in the forums: https://community.influxdata.com/t/execution-of-heavy-queries-result-in-a-crash/22637/3

I have also encountered crashes on bigger queries and suppose it's a similar issue, but couldn't investigate further yet.

trylaarsdam commented 2 years ago

Just wanted to mention this is still an issue in v2.2 - I can run the same query multiple times, but it appears there is some sort of memory leak and the influx instance eventually uses all available system memory, then causes the entire server to freeze until a hard restart. Granted I'm not running in docker like OP but still hitting that out of memory error:

image image

Sometimes the server can recover after an hour or so, sometimes it just stays frozen until I notice and restart it. (Also the gap in the memory graph is google's ops agent being killed on the server due to lack of resources which stops it reporting metrics.)

riosje commented 2 years ago

I think this is happening to every one using influxdb version 2.X, Actually it doesn't happen with the version 1.8. the memory management on influx 1.8 looks like this. image

But in influx 2.3 it just die after a while because run out of memory

opsxcq commented 1 year ago

@riosje I can confirm that the same issue happens with the influxdb:2.6.0 docker image.

polarnik commented 5 months ago

Hello! I can confirm that the same issue happens with the influxdb:2.7.6 docker image.

I have a big database with many shards. A result of the simple command: docker run --user=influxdb -d -p 8086:8086 --name influxdb --env-file env.list -v /home/influxdb:/var/lib/influxdb2 influxdb:2.7.6-alpine

is OOM, because the server will allocate memory for each files (WAL or indexes) without GC. The server will allocate 20 GiByte of memory, and it will stop with an error. The instance limit is 20 GiByte.

The workaround:

docker rm influxdb
# remove indexes
find /home/influxdb/engine/data/ -type d -name _series -exec rm -r {} +
find /home/influxdb/engine/data/ -type d -name index -exec rm -r {} +
# start
docker run  --user=influxdb -d -p 8086:8086  --name influxdb --env-file env.list -v /home/influxdb:/var/lib/influxdb2 influxdb:2.7.6-alpine

The server will recreate indexes. The server will allocate only 7 GiByte of memory. It will work well.

The full workaround solution:

[Unit]
Description=InfluxDB Service
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=0
Restart=always
ExecStartPre=-/usr/bin/docker stop %n
ExecStartPre=-/usr/bin/docker rm %n
ExecStartPre=/usr/bin/docker pull influxdb:2.7.6-alpine
ExecStartPre=find /home/influxdb/engine/data/ -type d -name _series -exec rm -r {} +
ExecStartPre=find /home/influxdb/engine/data/ -type d -name index -exec rm -r {} +
ExecStart=docker run --rm --user=influxdb -d -p 8086:8086 -m 16g --name %n --env-file env.list -v /home/influxdb:/var/lib/influxdb2 influxdb:2.7.6-alpine
ExecStop=/usr/bin/docker stop %n

[Install]
WantedBy=default.target

Cons:

Update. I'm thinking about a new entrypoint:

docker run --user=influxdb --restart=on-failure --restart unless-stopped --entrypoint '/bin/bash' -d -p 8086:8086 -m 12g --name influx --env-file env.list -v /home/influxdb:/var/lib/influxdb2 influxdb:2.7.6-alpine --verbose -c "find /var/lib/influxdb2/engine/data/ -type d -name _series -exec rm -r {} + && find /var/lib/influxdb2/engine/data/ -type d -name index -exec rm -r {} + && /entrypoint.sh influxd"

It includes:

find /var/lib/influxdb2/engine/data/ -type d -name _series -exec rm -r {} +
&&
find /var/lib/influxdb2/engine/data/ -type d -name index -exec rm -r {} +
&&
/entrypoint.sh influxd
riosje commented 5 months ago

this is wild, Influx is requiring way too much RAM to barely operate, and with the release of Influx V3 they will not fix this, because they need the people buy his new license. I just moved out to victoriaMetrics, the performance of that solution is really good, also the architecture is way better than influx.

polarnik commented 5 months ago

offtop_mode_on

I just moved out to victoriaMetrics

My teammates would like to have a storage without strict cardinality limits. And they would like to use some complex scripts for getting metrics. They are familiar with Flux, but not with MetricQL.

I used NGinx-proxy for data replication operations: https://gist.github.com/polarnik/cb6f22751e8d1590342198609243c529

And teammates had similar data in VictoriaMetrics and in InfluxDB. This is an old solution.

We are using InfluxDB for raw data and complex Flux queries. And VictoriaMetrics for aggregates and alerts, only. VictoriaMetrics has some limits, too. We use it for clean data, only

offtop_mode_off

polarnik commented 5 months ago

My current workaround is docker run --user=influxdb --restart=on-failure --restart unless-stopped --entrypoint '/bin/bash' -d -p 8086:8086 --log-driver=syslog --name influx --env-file env.list -v /home/influxdb:/var/lib/influxdb2 influxdb:2.7.6-alpine --verbose -c "find /var/lib/influxdb2/engine/data/ -type d -name _series -exec rm -r {} + && find /var/lib/influxdb2/engine/data/ -type d -name index -exec rm -r {} + && /entrypoint.sh influxd"

The log (short version):

The cost of reindexing is: 139 seconds

For databases with size ~= 6 GiByte:

du -d 1 -b /home/influxdb/engine/
33619605     /home/influxdb/engine/wal
6299686756   /home/influxdb/engine/data
4096         /home/influxdb/engine/replicationq
6333314553   /home/influxdb/engine/

The initial memory allocation (RAM) is 8.6 GiByte.

Databases contains metrics from the sitespeed.io tests: versions, browsers, etc. There are low cardinality tags and a lot of metrics

polarnik commented 5 months ago

I still have the OOM problem.

But I have logs. The root cause of the OOM problem is an operation: "TSI log compaction".

The first message was about "TSI log compaction (start)":

May 14 11:09:23 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:23.317877Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=8 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start

The operation "TSI log compaction" finished in two minutes, only: May 14 11:11:53 influxdb f7d421060a18[742]: ts=2024-05-14T11:11:00.371732Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=8 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start

Because the OOM happen:

Details

``` May 14 11:11:54 influxdb f7d421060a18[742]: memory allocation of 1056 bytes failed May 14 11:11:54 influxdb f7d421060a18[742]: SIGABRT: abort May 14 11:11:54 influxdb f7d421060a18[742]: PC=0x7f443269c792 m=149 sigcode=18446744073709551610 May 14 11:11:54 influxdb f7d421060a18[742]: signal arrived during cgo execution May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 677421 [syscall]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.cgocall(0x7f443235d3e0, 0xc218f518a8) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/cgocall.go:157 +0x4b fp=0xc218f51880 sp=0xc218f51848 pc=0x7f44305bab8b May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/flux/libflux/go/libflux._Cfunc_flux_analyze(0x7f42973f04d0, 0x7f380360bb20, 0xc0189b1310) May 14 11:11:54 influxdb f7d421060a18[742]: #011_cgo_gotypes.go:122 +0x50 fp=0xc218f518a8 sp=0xc218f51880 pc=0x7f4430b661f0 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/flux/libflux/go/libflux.AnalyzeWithOptions.func3(0xc218f51948?, 0x2?, 0x2?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/pkg/mod/github.com/influxdata/flux@v0.194.5/libflux/go/libflux/analyze.go:142 +0x7d fp=0xc218f518f0 sp=0xc218f518a8 pc=0x7f4430b680bd May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/flux/libflux/go/libflux.AnalyzeWithOptions(0xc027fecc88, {{0x0?, 0x7f44339026c0?, 0x7f44335f69a0?}}) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/pkg/mod/github.com/influxdata/flux@v0.194.5/libflux/go/libflux/analyze.go:142 +0x169 fp=0xc218f519f8 sp=0xc218f518f0 pc=0x7f4430b67d09 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/flux/runtime.AnalyzePackage({0x7f4433a0a9e8?, 0xc1b0121620?}, {0x7f4433a05c30?, 0xc027fecc88}) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/pkg/mod/github.com/influxdata/flux@v0.194.5/runtime/analyze_libflux.go:23 +0xb2 fp=0xc218f51a78 sp=0xc218f519f8 pc=0x7f4430b723f2 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/flux/runtime.(*runtime).Eval(0x7f44363628a0, {0x7f4433a0a9e8, 0xc1b0121620}, {0x7f4433a05c30?, 0xc027fecc88?}, {0x7f44339fc378, 0x7f443639ff20}, {0xc140df0580, 0x2, 0x2}) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/pkg/mod/github.com/influxdata/flux@v0.194.5/runtime/runtime.go:102 +0x85 fp=0xc218f51af8 sp=0xc218f51a78 pc=0x7f4430b742a5 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/flux/lang.(*AstProgram).getSpec(0xc1732f34a0, {0x7f4433a0a9e8, 0xc1b01215f0}, {0x7f44339e35f0?, 0x7f4433475460?}) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/pkg/mod/github.com/influxdata/flux@v0.194.5/lang/compiler.go:446 +0x2e3 fp=0xc218f51c88 sp=0xc218f51af8 pc=0x7f44314b3123 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/flux/lang.(*AstProgram).Start(0xc1732f34a0, {0x7f4433a0a9e8, 0xc1b01213e0}, {0x7f4433a0c310, 0xc16ff75f90}) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/pkg/mod/github.com/influxdata/flux@v0.194.5/lang/compiler.go:484 +0x1c9 fp=0xc218f51e98 sp=0xc218f51c88 pc=0x7f44314b3ca9 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/influxdb/v2/query/control.(*Controller).executeQuery(0xc218f51fa8?, 0xc1d4e4a1a0) May 14 11:11:54 influxdb f7d421060a18[742]: #011/root/project/query/control/controller.go:489 +0x219 fp=0xc218f51f48 sp=0xc218f51e98 pc=0x7f4432188df9 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/influxdb/v2/query/control.(*Controller).processQueryQueue(...) May 14 11:11:54 influxdb f7d421060a18[742]: #011/root/project/query/control/controller.go:447 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/influxdb/v2/query/control.New.func1() May 14 11:11:54 influxdb f7d421060a18[742]: #011/root/project/query/control/controller.go:232 +0x76 fp=0xc218f51fe0 sp=0xc218f51f48 pc=0x7f44321871b6 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc218f51fe8 sp=0xc218f51fe0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by github.com/influxdata/influxdb/v2/query/control.New in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/root/project/query/control/controller.go:230 +0x9ec May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 1 [chan receive, 120 minutes]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x4b939adcbb1a5?, 0x40000000?, 0x0?, 0x0?, 0x8bb2c97000?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc19a77bbc8 sp=0xc19a77bba8 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.chanrecv(0xc000637440, 0x0, 0x1) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/chan.go:583 +0x3cd fp=0xc19a77bc40 sp=0xc19a77bbc8 pc=0x7f44305bd1ad May 14 11:11:54 influxdb f7d421060a18[742]: runtime.chanrecv1(0xc000e942a0?, 0x7f4433a0aa20?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/chan.go:442 +0x12 fp=0xc19a77bc68 sp=0xc19a77bc40 pc=0x7f44305bcdb2 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/influxdb/v2/cmd/influxd/launcher.NewInfluxdCommand.cmdRunE.func1() May 14 11:11:54 influxdb f7d421060a18[742]: #011/root/project/cmd/influxd/launcher/cmd.go:127 +0x156 fp=0xc19a77bd00 sp=0xc19a77bc68 pc=0x7f4432256416 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/influxdata/influxdb/v2/kit/cli.NewCommand.func1(0xc000c58900?, {0x7f443639ff20?, 0x4?, 0x7f44326a9e5b?}) May 14 11:11:54 influxdb f7d421060a18[742]: #011/root/project/kit/cli/viper.go:54 +0x16 fp=0xc19a77bd10 sp=0xc19a77bd00 pc=0x7f4431059a96 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/spf13/cobra.(*Command).execute(0xc000c65b80, {0xc00011e0b0, 0x0, 0x0}) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:842 +0x694 fp=0xc19a77bdf8 sp=0xc19a77bd10 pc=0x7f4430fdc654 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/spf13/cobra.(*Command).ExecuteC(0xc000c65b80) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950 +0x389 fp=0xc19a77beb0 sp=0xc19a77bdf8 pc=0x7f4430fdcc09 May 14 11:11:54 influxdb f7d421060a18[742]: github.com/spf13/cobra.(*Command).Execute(...) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887 May 14 11:11:54 influxdb f7d421060a18[742]: main.main() May 14 11:11:54 influxdb f7d421060a18[742]: #011/root/project/cmd/influxd/main.go:61 +0x50a fp=0xc19a77bf40 sp=0xc19a77beb0 pc=0x7f443228c84a May 14 11:11:54 influxdb f7d421060a18[742]: runtime.main() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:267 +0x2d2 fp=0xc19a77bfe0 sp=0xc19a77bf40 pc=0x7f44305f1892 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc19a77bfe8 sp=0xc19a77bfe0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 2 [force gc (idle), 122 minutes]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goparkunlock(...) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:404 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.forcegchelper() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:322 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x7f44305f1b78 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by runtime.init.6 in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:310 +0x1a May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 3 [runnable]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goschedIfBusy() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:361 +0x28 fp=0xc000085778 sp=0xc000085760 pc=0x7f44305f1c28 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.bgsweep(0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgcsweep.go:305 +0x151 fp=0xc0000857c8 sp=0xc000085778 pc=0x7f44305dbdd1 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gcenable.func1() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:200 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x7f44305d0e65 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by runtime.gcenable in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:200 +0x66 May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 4 [GC scavenge wait]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x975898e?, 0x8286c0?, 0x0?, 0x0?, 0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc000085f70 sp=0xc000085f50 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goparkunlock(...) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:404 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.(*scavengerState).park(0x7f4436368800) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa0 sp=0xc000085f70 pc=0x7f44305d95c9 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.bgscavenge(0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa0 pc=0x7f44305d9b79 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gcenable.func2() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:201 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x7f44305d0e05 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by runtime.gcenable in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:201 +0xa5 May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 18 [finalizer wait]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x0?, 0x7f44339dd3d0?, 0x40?, 0xf?, 0x1000000010?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.runfinq() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x7f44305cfe87 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by runtime.createfing in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mfinal.go:163 +0x3d May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 19 [GC worker (idle), 2 minutes]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x7f44363a2c40?, 0x3?, 0xcc?, 0x31?, 0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc000080750 sp=0xc000080730 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gcBgMarkWorker() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1295 +0xe5 fp=0xc0000807e0 sp=0xc000080750 pc=0x7f44305d2a25 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by runtime.gcBgMarkStartWorkers in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1219 +0x1c May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 34 [GC worker (idle), 2 minutes]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x7f44363a2c40?, 0x3?, 0x2a?, 0xf?, 0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc000486750 sp=0xc000486730 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gcBgMarkWorker() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1295 +0xe5 fp=0xc0004867e0 sp=0xc000486750 pc=0x7f44305d2a25 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0004867e8 sp=0xc0004867e0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by runtime.gcBgMarkStartWorkers in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1219 +0x1c May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 5 [GC worker (idle)]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x22c370d133d90?, 0x1?, 0x7?, 0x79?, 0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc000086750 sp=0xc000086730 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gcBgMarkWorker() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1295 +0xe5 fp=0xc0000867e0 sp=0xc000086750 pc=0x7f44305d2a25 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by runtime.gcBgMarkStartWorkers in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1219 +0x1c May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 20 [GC worker (idle), 2 minutes]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x22c1e446c85de?, 0x3?, 0x84?, 0xcb?, 0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc000080f50 sp=0xc000080f30 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gcBgMarkWorker() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1295 +0xe5 fp=0xc000080fe0 sp=0xc000080f50 pc=0x7f44305d2a25 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by runtime.gcBgMarkStartWorkers in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1219 +0x1c May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 35 [GC worker (idle)]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x22c3de1809b1a?, 0x1?, 0x65?, 0x7b?, 0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc000486f50 sp=0xc000486f30 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gcBgMarkWorker() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1295 +0xe5 fp=0xc000486fe0 sp=0xc000486f50 pc=0x7f44305d2a25 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000486fe8 sp=0xc000486fe0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by runtime.gcBgMarkStartWorkers in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1219 +0x1c May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 6 [GC worker (idle)]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x7f44339d89b0?, 0xc0001540a0?, 0x1a?, 0x14?, 0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc000086f50 sp=0xc000086f30 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gcBgMarkWorker() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1295 +0xe5 fp=0xc000086fe0 sp=0xc000086f50 pc=0x7f44305d2a25 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000086fe8 sp=0xc000086fe0 pc=0x7f4430625421 May 14 11:11:54 influxdb f7d421060a18[742]: created by runtime.gcBgMarkStartWorkers in goroutine 1 May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1219 +0x1c May 14 11:11:54 influxdb f7d421060a18[742]: goroutine 21 [GC worker (idle)]: May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gopark(0x22c3de1807bf7?, 0x3?, 0x2c?, 0xa1?, 0x0?) May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/proc.go:398 +0xce fp=0xc000081750 sp=0xc000081730 pc=0x7f44305f1d0e May 14 11:11:54 influxdb f7d421060a18[742]: runtime.gcBgMarkWorker() May 14 11:11:54 influxdb f7d421060a18[742]: #011/go/src/runtime/mgc.go:1295 +0xe5 fp=0xc0000817e0 sp=0xc000081750 pc=0x7f44305d2a25 May 14 11:11:54 influxdb f7d421060a18[742]: runtime.goexit() ```

My settings

``` "Env": [ "INFLUXD_REPORTING_DISABLED=true", "INFLUXD_STORAGE_CACHE_SNAPSHOT_WRITE_COLD_DURATION=10m0s", "INFLUXD_STORAGE_COMPACT_FULL_WRITE_COLD_DURATION=1h0m0s", "INFLUXD_STORAGE_COMPACT_THROUGHPUT_BURST=80388608", "INFLUXD_STORAGE_MAX_CONCURRENT_COMPACTIONS=2", "INFLUXD_STORAGE_SERIES_FILE_MAX_CONCURRENT_SNAPSHOT_COMPACTIONS=2", "INFLUXDB_DATA_INDEX_VERSION=\"tsi1\"", "INFLUXDB_DATA_CACHE_SNAPSHOT_MEMORY_SIZE=\"200m\"", "INFLUXDB_DATA_MAX_INDEX_LOG_FILE_SIZE=10485760", "INFLUXDB_DATA_SERIES_ID_SET_CACHE_SIZE=100", "INFLUXD_QUERY_MEMORY_BYTES=304857600", "INFLUXD_QUERY_INITIAL_MEMORY_BYTES=10485760", "INFLUXD_QUERY_CONCURRENCY=5", "INFLUXD_STORAGE_CACHE_MAX_MEMORY_SIZE=1073741824", "INFLUXD_STORAGE_CACHE_SNAPSHOT_MEMORY_SIZE=262144000", "INFLUXD_QUERY_QUEUE_SIZE=100", "INFLUXD_FLUX_LOG_ENABLED=true", "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "INFLUXDB_VERSION=2.7.6", "INFLUX_CLI_VERSION=2.7.3", "INFLUX_CONFIGS_PATH=/etc/influxdb2/influx-configs", "INFLUXD_INIT_PORT=9999", "INFLUXD_INIT_PING_ATTEMPTS=600", "DOCKER_INFLUXDB_INIT_CLI_CONFIG_NAME=default" ], ```

polarnik commented 5 months ago

I used some limits (2) for compactions settings:

My settings

> Duration at which the storage engine will compact all TSM files in a shard if it hasn’t received writes or deletes. `export INFLUXD_STORAGE_COMPACT_FULL_WRITE_COLD_DURATION=4h0m0s` my `export INFLUXD_STORAGE_COMPACT_FULL_WRITE_COLD_DURATION=1h0m0s` > Rate limit (in bytes per second) that TSM compactions can write to disk. `export INFLUXD_STORAGE_COMPACT_THROUGHPUT_BURST=50331648` my `export INFLUXD_STORAGE_COMPACT_THROUGHPUT_BURST=80388608` > Maximum number of full and level compactions that can run concurrently. A value of 0 results in 50% of runtime.GOMAXPROCS(0) used at runtime. Any number greater than zero limits compactions to that value. This setting does not apply to cache snapshotting. `export INFLUXD_STORAGE_MAX_CONCURRENT_COMPACTIONS=0` my `export INFLUXD_STORAGE_MAX_CONCURRENT_COMPACTIONS=2` > Maximum number of snapshot compactions that can run concurrently across all series partitions in a database. `export INFLUXD_STORAGE_SERIES_FILE_MAX_CONCURRENT_SNAPSHOT_COMPACTIONS=0` my: `export INFLUXD_STORAGE_SERIES_FILE_MAX_CONCURRENT_SNAPSHOT_COMPACTIONS=2`

I have some duplicate lines in the logs:

logs

``` May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.257996Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=4 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.260277Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=6 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.260525Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=5 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.262446Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=7 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.264009Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=3 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.265600Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=6 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.267097Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=7 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.269065Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=1 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.260268Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=8 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.269662Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=8 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.270772Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=2 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.274354Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=4 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.283160Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=1 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.289673Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=5 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.291732Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=3 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.291732Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=2 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.298355Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=6 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start May 14 11:09:25 influxdb f7d421060a18[742]: ts=2024-05-14T11:09:25.305572Z lvl=info msg="TSI log compaction (start)" log_id=0p9aevYl000 service=storage-engine index=tsi tsi1_partition=4 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start ```

Maybe the server has some race condition problem, because 2 threads work in parallel with a same file. I'm going to try

INFLUXD_STORAGE_MAX_CONCURRENT_COMPACTIONS=1
INFLUXD_STORAGE_SERIES_FILE_MAX_CONCURRENT_SNAPSHOT_COMPACTIONS=1

I'm going to disable the compaction:

INFLUXD_STORAGE_CACHE_SNAPSHOT_WRITE_COLD_DURATION=1000d
INFLUXD_STORAGE_COMPACT_FULL_WRITE_COLD_DURATION=1000d
philjb commented 5 months ago

@polarnik - Setting env var GOMEMLIMIT might help you. It sets a soft memory limit which will make the GC more aggressive as it nears the limit. I don't expect it to be a cure-all though. It became available in golang runtimes 1.19. Influxdb 2.7+ is built with at least go1.20.

polarnik commented 5 months ago

@philjb

I will use settings with MEM Limit and GC settings, and I have disabled INFLUXD_STORAGE_COMPACT_FULL_WRITE_COLD_DURATION:

env.list

``` GOMEMLIMIT=20GiB GOGC=10 INFLUXD_REPORTING_DISABLED=true INFLUXD_STORAGE_CACHE_SNAPSHOT_WRITE_COLD_DURATION=1000d INFLUXD_STORAGE_COMPACT_FULL_WRITE_COLD_DURATION=1000d INFLUXD_STORAGE_COMPACT_THROUGHPUT_BURST=80388608 INFLUXD_STORAGE_MAX_CONCURRENT_COMPACTIONS=1 INFLUXD_STORAGE_SERIES_FILE_MAX_CONCURRENT_SNAPSHOT_COMPACTIONS=1 INFLUXDB_DATA_INDEX_VERSION="tsi1" INFLUXDB_DATA_CACHE_SNAPSHOT_MEMORY_SIZE="200m" INFLUXDB_DATA_MAX_INDEX_LOG_FILE_SIZE=10485760 INFLUXDB_DATA_SERIES_ID_SET_CACHE_SIZE=100 INFLUXD_QUERY_MEMORY_BYTES=304857600 INFLUXD_QUERY_INITIAL_MEMORY_BYTES=10485760 INFLUXD_QUERY_CONCURRENCY=5 INFLUXD_STORAGE_CACHE_MAX_MEMORY_SIZE=1073741824 INFLUXD_STORAGE_CACHE_SNAPSHOT_MEMORY_SIZE=262144000 INFLUXD_QUERY_QUEUE_SIZE=100 INFLUXD_FLUX_LOG_ENABLED=false ```

I still use the hack with reindexing instead of compaction:

find /var/lib/influxdb2/engine/data/ -type d -name _series -exec rm -r {} + &&\
find /var/lib/influxdb2/engine/data/ -type d -name index -exec rm -r {} +
the entire command line

`docker run --shm-size 2g -m 25GiB --user=influxdb --restart=on-failure --restart unless-stopped --entrypoint '/bin/bash' -d -p 8086:8086 --log-driver=syslog --name influx --env-file env.list -v /home/influxdb:/var/lib/influxdb2 influxdb:2.7.6-alpine --verbose -c "find /var/lib/influxdb2/engine/data/ -type d -name _series -exec rm -r {} + && find /var/lib/influxdb2/engine/data/ -type d -name index -exec rm -r {} + && /entrypoint.sh influxd"`

It works well.

I saw the memory allocation error before meeting the memory limit:

memory

image My memory limits are: - 30 GiByte in a station - 25 GiByte in a docker container - 20 GiByte in [GOMEMLIMIT](https://pkg.go.dev/runtime) The current memory allocation is 5-6 GiByte. I did't see memory allocation ~= 20 GiByte, I had the allocation ~= 10 GiByte, only, but I had a container-restart-state.

I have too many influxdb threads == 71:

memory

image

I have calculated how many files they use:

lsof > /tmp/lsof.info2.txt
cat /tmp/lsof.info2.txt | grep influx | awk '{ print $11 }' > /tmp/files.txt
cat /tmp/files.txt| sort > /tmp/files.sorted.txt
cat /tmp/files.sorted.txt | uniq -c > /tmp/counts.txt
image

71 threads use 73 000 file descriptors. There are 5 183 000 file descriptors in sum.

The influxdb process and all threads (green threads) could reach some limit, but it might not be the memory limit, but the virtual memory limit or the file descriptor limit.

What do you think ? Do we have some environment variable about threads number? The default value is about 70. Is it possible to reduce the threads number?

philjb commented 5 months ago

I only skimmed through your response - you can set GOMAXPROCS to limit the number of OS threads, but I believe those processes showing in htop are golang's greenthreads - you can't limit those. See https://pkg.go.dev/runtime I don't think the number of green threads should be an issue? I'm not aware of a limit for those from linux.

Influxdb can use a lot of file descriptors - you can raise it with ulimit (as you probably know).

polarnik commented 5 months ago

My current recipe

  1. Remove indexes
  2. Disable index compactions

The custom command docker run --shm-size 2g --user=influxdb --restart=on-failure --restart unless-stopped --entrypoint '/bin/bash' -d -p 8086:8086 --log-driver=syslog --name influx --env-file env.list -v /home/influxdb:/var/lib/influxdb2 influxdb:2.7.6-alpine --verbose -c "find /var/lib/influxdb2/engine/data/ -type d -name index -exec rm -r {} + && /entrypoint.sh influxd" and the custom config

INFLUXD_STORAGE_COMPACT_FULL_WRITE_COLD_DURATION=48h
INFLUXD_STORAGE_SERIES_ID_SET_CACHE_SIZE=0

work well. Docker container restarts every 48 hours, only. I have started it at night. It restarted at night in 48 hours. The night is a convenient time for restarts. Duration of restarts is 1-2 minutes.

The option INFLUXD_STORAGE_COMPACT_FULL_WRITE_COLD_DURATION=1000d didn't work well. It was equal to 3h.

env.list

``` GOMEMLIMIT=25GiB GOGC=10 INFLUXD_REPORTING_DISABLED=true INFLUXD_STORAGE_CACHE_SNAPSHOT_WRITE_COLD_DURATION=10m INFLUXD_STORAGE_COMPACT_FULL_WRITE_COLD_DURATION=48h INFLUXD_STORAGE_COMPACT_THROUGHPUT_BURST=80388608 INFLUXD_STORAGE_MAX_CONCURRENT_COMPACTIONS=1 INFLUXD_STORAGE_SERIES_FILE_MAX_CONCURRENT_SNAPSHOT_COMPACTIONS=1 INFLUXD_QUERY_MEMORY_BYTES=304857600 INFLUXD_QUERY_INITIAL_MEMORY_BYTES=10485760 INFLUXD_QUERY_CONCURRENCY=5 INFLUXD_STORAGE_CACHE_MAX_MEMORY_SIZE=1073741824 INFLUXD_STORAGE_CACHE_SNAPSHOT_MEMORY_SIZE=26214400 INFLUXD_STORAGE_WAL_MAX_WRITE_DELAY=10m INFLUXD_STORAGE_WRITE_TIMEOUT=10s INFLUXD_STORAGE_WAL_MAX_CONCURRENT_WRITES=6 INFLUXD_STORAGE_SERIES_ID_SET_CACHE_SIZE=0 INFLUXD_QUERY_QUEUE_SIZE=100 INFLUXD_FLUX_LOG_ENABLED=false ```

docker run --shm-size 2g --user=influxdb --restart=on-failure --restart unless-stopped -d -p 8086:8086 --log-driver=syslog --name influx --env-file env.list -v /home/influxdb:/var/lib/influxdb2 influxdb:2.7.6-alpine

The simple command has a memory allocation error. The error occurs at a memory allocation point ~ of 25% (6 GiByte from 25-30 GiByte). A server and a Docker container have available RAM, but operations "Open TSI" get the allocation error.

There is a solution: https://github.com/influxdata/influxdb/issues/23246

sysctl -w vm.max_map_count=262144