influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.28k stars 3.52k forks source link

2.0.6 suddenly stopped responding because of too many open files #24057

Open barrown opened 1 year ago

barrown commented 1 year ago

I have influxdb 2.0.6 running on a raspberry pi quite happily for nearly 2 years, but this morning I noticed it had stopped responding. Even after a reboot it is not happy. I can't get any connection via the web interface (it justs shows the swirling circle) nor via the HTTP API, nor via influx CLI (e.g. influx bucket list just hangs there).

Can anyone advise any more steps to try? Influxd seems to startup fine without error.

Environment info:

Linux 5.10.17-v8+ aarch64 InfluxDB 2.0.6 (git: 4db98b4c9a) build_date: 2021-04-29T16:48:12Z Database files are all on an SSD drive

Config: /usr/local/bin/influxd --bolt-path=/ssd/influx/influxd.bolt --engine-path=/ssd/influx/engine --reporting-disabled --storage-retention-check-interval=24h --log-level=debug

Logs:

Jan 23 09:36:01 hass systemd[1]: Started influxdb.
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.784207Z lvl=info msg="Welcome to InfluxDB" log_id=0fZVcsml000 version=2.0.6 commit=4db98b4c9a build_date=2021-04-29T16:48:12Z
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.788758Z lvl=info msg="Resources opened" log_id=0fZVcsml000 service=bolt path=/ssd/influx/influxd.bolt
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.811082Z lvl=debug msg="buckets find" log_id=0fZVcsml000 store=new took=0.371ms
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.811169Z lvl=info msg="Checking InfluxDB metadata for prior version." log_id=0fZVcsml000 bolt_path=/ssd/influx/influxd.bolt
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.811754Z lvl=info msg="Using data dir" log_id=0fZVcsml000 service=storage-engine service=store path=/ssd/influx/engine/data
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.811825Z lvl=info msg="Compaction settings" log_id=0fZVcsml000 service=storage-engine service=store max_concurrent_compactions=2 throughput_bytes_per_second=50331648 throug
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.811855Z lvl=info msg="Open store (start)" log_id=0fZVcsml000 service=storage-engine service=store op_name=tsdb_open op_event=start
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.883222Z lvl=info msg="index opened with 8 partitions" log_id=0fZVcsml000 service=storage-engine index=tsi
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.930022Z lvl=info msg="index opened with 8 partitions" log_id=0fZVcsml000 service=storage-engine index=tsi
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.932094Z lvl=info msg="index opened with 8 partitions" log_id=0fZVcsml000 service=storage-engine index=tsi
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.945803Z lvl=info msg="index opened with 8 partitions" log_id=0fZVcsml000 service=storage-engine index=tsi
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.956105Z lvl=info msg="Opened file" log_id=0fZVcsml000 service=storage-engine engine=tsm1 service=filestore path=/ssd/influx/engine/data/540a390190697254/autogen/1047/00000
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.963548Z lvl=info msg="Opened file" log_id=0fZVcsml000 service=storage-engine engine=tsm1 service=filestore path=/ssd/influx/engine/data/540a390190697254/autogen/111/000000
Jan 23 09:36:35 hass influxd[1204]: ts=2023-01-23T09:36:35.983334Z lvl=info msg="Opened file" log_id=0fZVcsml000 service=storage-engine engine=tsm1 service=filestore path=/ssd/influx/engine/data/540a390190697254/autogen/1081/00000
Jan 23 09:36:36 hass influxd[1204]: ts=2023-01-23T09:36:36.055497Z lvl=info msg="Opened shard" log_id=0fZVcsml000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/ssd/influx/engine/data/514933d6df66d
Jan 23 09:36:36 hass influxd[1204]: ts=2023-01-23T09:36:36.055515Z lvl=info msg="Opened shard" log_id=0fZVcsml000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/ssd/influx/engine/data/540a390190697
Jan 23 09:36:36 hass influxd[1204]: ts=2023-01-23T09:36:36.055532Z lvl=info msg="Opened shard" log_id=0fZVcsml000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/ssd/influx/engine/data/540a390190697
Jan 23 09:36:36 hass influxd[1204]: ts=2023-01-23T09:36:36.055500Z lvl=info msg="Opened shard" log_id=0fZVcsml000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/ssd/influx/engine/data/540a390190697
Jan 23 09:36:36 hass influxd[1204]: ts=2023-01-23T09:36:36.091027Z lvl=info msg="index opened with 8 partitions" log_id=0fZVcsml000 service=storage-engine index=tsi
Jan 23 09:36:36 hass influxd[1204]: ts=2023-01-23T09:36:36.102203Z lvl=info msg="index opened with 8 partitions" log_id=0fZVcsml000 service=storage-engine index=tsi
barrown commented 1 year ago
Jan 23 17:51:16 hass influxd[741]: ts=2023-01-23T17:51:16.668081Z lvl=info msg="http: Accept error: accept tcp [::]:8086: accept4: too many open files; retrying in 1s" log_id=0fZwjjd0000 service=http
Jan 23 17:51:17 hass influxd[741]: ts=2023-01-23T17:51:17.668457Z lvl=info msg="http: Accept error: accept tcp [::]:8086: accept4: too many open files; retrying in 1s" log_id=0fZwjjd0000 service=http
Jan 23 17:51:18 hass influxd[741]: ts=2023-01-23T17:51:18.669742Z lvl=info msg="http: Accept error: accept tcp [::]:8086: accept4: too many open files; retrying in 5ms" log_id=0fZwjjd0000 service=http
Jan 23 17:51:18 hass influxd[741]: ts=2023-01-23T17:51:18.673745Z lvl=debug msg="user find by ID" log_id=0fZwjjd0000 store=new took=0.239ms
Jan 23 17:51:18 hass influxd[741]: ts=2023-01-23T17:51:18.675510Z lvl=info msg="http: Accept error: accept tcp [::]:8086: accept4: too many open files; retrying in 10ms" log_id=0fZwjjd0000 service=http
Jan 23 17:51:18 hass influxd[741]: ts=2023-01-23T17:51:18.676462Z lvl=debug msg=Request log_id=0fZwjjd0000 service=http method=GET host=localhost:8086 path=/api/v2/backup/kv query= proto=HTTP/1.1 status_code=500 response_size=68 content_length=0 referrer= remote=[::1]:38928 user_agent=Go-http-client took=5.561ms error="internal error" error_code="internal error" body=
Jan 23 17:51:18 hass influxd[741]: ts=2023-01-23T17:51:18.687324Z lvl=info msg="http: Accept error: accept tcp [::]:8086: accept4: too many open files; retrying in 20ms" log_id=0fZwjjd0000 service=http
Jan 23 17:51:18 hass influxd[741]: ts=2023-01-23T17:51:18.690145Z lvl=debug msg="is onboarding" log_id=0fZwjjd0000 handler=onboard took=0.313ms
Jan 23 17:51:18 hass influxd[741]: ts=2023-01-23T17:51:18.690259Z lvl=debug msg="Onboarding eligibility check finished" log_id=0fZwjjd0000 result=false
Jan 23 17:51:18 hass influxd[741]: ts=2023-01-23T17:51:18.690853Z lvl=debug msg=Request log_id=0fZwjjd0000 service=http method=GET host=localhost:8086 path=/api/v2/setup query= proto=HTTP/1.1 status_code=200 response_size=21 content_length=0 referrer= remote=[::1]:38928 user_agent=influx took=1.325ms body=

Because of the "too many open files" error I increased the limit. Now I can get to the web interface and view some of the data, but not actually perform a backup.

output from ulimit -a:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 6362
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1000000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 95
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 6362
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
jeffreyssmith2nd commented 1 year ago

How many shards (tsm files) do you have? You can run an influxd inspect report-tsm and it should give you a summary.

When you say you cannot perform a backup, what do you mean exactly?

barrown commented 1 year ago

Thanks for your reply!

By backup I mean run "influx backup" with my token. Which results in:

2023-01-24T10:53:04.520822Z     info    Backing up KV store     {"log_id": "0f_rPOY0000", "path": "/ssd/influx/backups/backup_2023-01-24_10-53/20230124T105304Z.bolt"}
Error: Failed to download KV backup: An internal error has occurred.

"journalctl -u influxdb.service" reveals:

Jan 24 10:49:48 hass influxd[3191]: ts=2023-01-24T10:49:48.051374Z lvl=info msg="http: Accept error: accept tcp [::]:8086: accept4: too many open files; retrying in 5ms" log_id=0f_qxX80000 service=http
Jan 24 10:49:48 hass influxd[3191]: ts=2023-01-24T10:49:48.053074Z lvl=debug msg="user find by ID" log_id=0f_qxX80000 store=new took=0.134ms
Jan 24 10:49:48 hass influxd[3191]: ts=2023-01-24T10:49:48.054286Z lvl=debug msg=Request log_id=0f_qxX80000 service=http method=GET host=localhost:8086 path=/api/v2/backup/kv query= proto=HTTP/1.1 status_code=500 response_size=68 content_length=
Jan 24 10:49:48 hass influxd[3191]: ts=2023-01-24T10:49:48.056743Z lvl=info msg="http: Accept error: accept tcp [::]:8086: accept4: too many open files; retrying in 10ms" log_id=0f_qxX80000 service=http
Jan 24 10:49:48 hass influxd[3191]: ts=2023-01-24T10:49:48.057355Z lvl=debug msg="is onboarding" log_id=0f_qxX80000 handler=onboard took=0.280ms
Jan 24 10:49:48 hass influxd[3191]: ts=2023-01-24T10:49:48.057439Z lvl=debug msg="Onboarding eligibility check finished" log_id=0f_qxX80000 result=false
Jan 24 10:49:48 hass influxd[3191]: ts=2023-01-24T10:49:48.057970Z lvl=debug msg=Request log_id=0f_qxX80000 service=http method=GET host=localhost:8086 path=/api/v2/setup query= proto=HTTP/1.1 status_code=200 response_size=21 content_length=0 re
Jan 24 10:49:48 hass influxd[3191]: ts=2023-01-24T10:49:48.067161Z lvl=info msg="http: Accept error: accept tcp [::]:8086: accept4: too many open files; retrying in 20ms" log_id=0f_qxX80000 service=http

In the end managed to "influxd inspect export-lp" with a date range of one month. So at least I have my data safely out now.

"influxd inspect report-tsm" only became available in 2.1 and I'm on 2.0. I have been trying to upgrade for a while now but could never manage the backup-restore to a newer version, see #23212