influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.84k stars 3.55k forks source link

influxdb refuse connections - out of memory #24224

Open torsj opened 1 year ago

torsj commented 1 year ago

Influxdb started to refuse connections after it had been running on a raspi 4 for more than two years. When grafana or my mqtt bridge or the influx command locally tried to connect to port 8086 the connection was refused with the same error message "Failed to connect to localhost:8086" etc. A few days before influx crashed, my bridge program got timeouts when writing to influx and Grafana was slow.

I then started influxdb from the command line like this to get the full log output: sudo killall influxd sudo /usr/bin/influxd -config /etc/influxdb/influxdb.conf 2>/tmp/startup.txt

In the log file I found this:

ts=2023-05-04T06:46:22.353691Z lvl=info msg="Reading file" log_id=0haO1Kjl000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/nexahack_test/autogen/548/_10000.wal size=438 ts=2023-05-04T06:46:22.353845Z lvl=info msg="Reading file" log_id=0haO1Kjl000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/nexahack_test/autogen/548/_100000.wal size=5826 runtime: out of memory: cannot allocate 2116255744-byte block (91488256 in use) fatal error: out of memory

runtime stack: runtime.throw(0x8ca717, 0xd) /usr/lib/go-1.11/src/runtime/panic.go:608 +0x5c runtime.largeAlloc(0x7e237278, 0x20380101, 0xb6d3e000) /usr/lib/go-1.11/src/runtime/malloc.go:1021 +0x120 runtime.mallocgc.func1() /usr/lib/go-1.11/src/runtime/malloc.go:914 +0x38 runtime.systemstack(0x152) /usr/lib/go-1.11/src/runtime/asm_arm.s:354 +0x84 runtime.mstart() /usr/lib/go-1.11/src/runtime/proc.go:1229

startup.txt

Note that the allocation request is for 2.1 gigabyte.

The complete 2.5 gigabyte database is available.

The linux version is 5.10.103-v7l+, buster on a raspi 4. Influx is version 1.6.4.

There are several other issues on Github around this theme, and most of them involve a raspi. The SD-card on the raspi is vulnerable to corruption at a loss of power. Is it possible that the cause of this problem is that a data file has been corrupted? The SD card on my computer pass an fsck.

I 'solved' the problem by starting over with an empty database, losing all my data. The upside is that this made me move influx and grafana to a more powerful computer, as the raspi4 is rather slow for this purpose.

This problem is probably reproducible by using my database on another computer with influx. I have not tried this.

The fix would be to report the actual problem, i.e. out of memory or maybe that a file is corrupted. systemctl status only reported that influx restarted too many times.

pkkrusty commented 1 year ago

I'm in the same boat. Influxdb running for around two years, suddenly can't start with OOM error. Not a huge database, 700 series. Already running tsi vs inmem. Rather frustrating. Let me know if you have any other solutions vs starting from scratch. Wish I could trust InfluxDB, but behavior like this is why I need to have a second copy of important data in a MySQL database.

pkkrusty commented 1 year ago

My databases totaled around 4gb, with 2.6gb in autogen. I deleted 1.5gb of old shards and influxdb started right up.