influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.85k stars 3.55k forks source link

Obscure error after influxdb upgrade 1.8 -> 2.7 #24723

Closed henningWoehr closed 3 months ago

henningWoehr commented 7 months ago

Hi, we are currently on our way to upgrade a production influxdb from 1.8.10 to 2.7.5. Before upgrading, I tested the whole process in local test VM, which all worked fine. The main difference to the production VM is, that I didn't use all db's, because it would take too long to copy. I then used the following commands to upgrade influx:

  1. sudo systemctl stop influxdb
  2. sudo apt-get update
  3. sudo apt-get upgrade
  4. sudo apt-get install influxdb2
  5. sudo nano /etc/default/influxdb2 (Changed config env to '/var/lib/influxdb/.influxdbv2/config.toml')
  6. sudo mkdir /datadrive/influxdb2
  7. sudo chown influxdb:influxdb /datadrive/influxdb2
  8. sudo nano /etc/systemd/system/influxd.service (Change 'LimitNOFILE' to higher number)
  9. sudo -u influxdb influxd upgrade -e /datadrive/influxdb2/engine -m /datadrive/influxdb2/influxd.bolt
  10. sudo systemctl start influxdb

After upgrading, influxdb started up and all data was working as expected. After some time, influxdb ran into a strange error with no real error message.

This is just a sample, the full log is here as a gist, with logs about unauthorized access and queries removed

Feb 29 19:48:32 biogasboard1eu influxd-systemd-start.sh[19059]: ts=2024-02-29T19:48:32.707739Z lvl=info msg="TSI log compaction (end)" log_id=0neLCLPW000 service=storage-engine index=tsi tsi1_partition=2 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=13857.850ms
Feb 29 19:48:32 biogasboard1eu influxd-systemd-start.sh[19059]: ts=2024-02-29T19:48:32.707772Z lvl=info msg="TSI log compaction (end)" log_id=0neLCLPW000 service=storage-engine index=tsi tsi1_partition=3 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=12298.524ms
Feb 29 19:49:02 biogasboard1eu systemd-journald[487]: Suppressed 1174405 messages from influxdb.service
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]:         /root/project/tsdb/index/tsi1/partition.go:960 +0x119 fp=0xc215c01fc8 sp=0xc215c01f30 pc=0x7fe5fa25b059
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: github.com/influxdata/influxdb/v2/tsdb/index/tsi1.(*Partition).Open.func3()
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]:         /root/project/tsdb/index/tsi1/partition.go:254 +0x26 fp=0xc215c01fe0 sp=0xc215c01fc8 pc=0x7fe5fa2553c6
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: runtime.goexit()
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]:         /go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc215c01fe8 sp=0xc215c01fe0 pc=0x7fe5f94ca7a1
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: created by github.com/influxdata/influxdb/v2/tsdb/index/tsi1.(*Partition).Open
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]:         /root/project/tsdb/index/tsi1/partition.go:254 +0xbdc
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: goroutine 741872 [select, 56 minutes]:
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: runtime.gopark(0xc219064f90?, 0x2?, 0xc0?, 0x5a?, 0xc219064f6c?)
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]:         /go/src/runtime/proc.go:381 +0xd6 fp=0xc219064df0 sp=0xc219064dd0 pc=0x7fe5f9496e56
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: runtime.selectgo(0xc219064f90, 0xc219064f68, 0xc21ffdd248?, 0x0, 0xc219064f68?, 0x1)
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]:         /go/src/runtime/select.go:327 +0x7be fp=0xc219064f30 sp=0xc219064df0 pc=0x7fe5f94a751e
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: github.com/influxdata/influxdb/v2/tsdb/index/tsi1.(*Partition).runPeriodicCompaction(0xc2200e86c0)
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]:         /root/project/tsdb/index/tsi1/partition.go:960 +0x119 fp=0xc219064fc8 sp=0xc219064f30 pc=0x7fe5fa25b059
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: github.com/influxdata/influxdb/v2/tsdb/index/tsi1.(*Partition).Open.func3()
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]:         /root/project/tsdb/index/tsi1/partition.go:254 +0x26 fp=0xc219064fe0 sp=0xc219064fc8 pc=0x7fe5fa2553c6
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: runtime.goexit()

Steps to reproduce: Can't think of a way to reproduce. Currently trying to test the upgrade with full data in my test VM.

Expected behaviour: Run as normal

Actual behaviour: Error described above

Environment info:

Config: The only change in config is, that the data is stored in an other path then the default

henningWoehr commented 7 months ago

I just tested the whole upgrade process with all the data in a test VM and I get the same errors, but noticed some logs about memory allocation. When restarting and looking at the memory through htop, I can see the memory rise to about 8 GB and then drop, because influxdb is crashed

Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.901514Z lvl=error msg="Cannot read corrupt tsm file, renaming" log_id=0nebSz9l000 service=storage-engine engine=tsm1 service=filestore path=/datadrive/influxdb2/engine/data/993a17731a4724e8/autogen/1887/000000008-000000002.tsm id=0 error="cannot allocate memory"
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.901570Z lvl=error msg="Failed to open shard" log_id=0nebSz9l000 service=storage-engine service=store op_name=tsdb_open db_shard_id=1887 error="[shard 1887] cannot read corrupt file /datadrive/influxdb2/engine/data/993a17731a4724e8/autogen/1887/000000008-000000002.tsm: cannot allocate memory"
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.904493Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=6 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=134.631ms
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.904528Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=5 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=138.782ms
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.904552Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=4 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=117.670ms
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.905670Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=3 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=132.550ms
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.906022Z lvl=info msg="Log file compacted" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=7 op_name=tsi1_compact_log_file tsi1_log_file_id=1 elapsed=119ms bytes=4320 kb_per_sec=35
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.907656Z lvl=error msg="Cannot open compacted index file" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=1 op_name=tsi1_compact_log_file tsi1_log_file_id=1 error="cannot allocate memory" path=/datadrive/influxdb2/engine/data/993a17731a4724e8/autogen/1999/index/0/L1-00000001.tsi
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.907668Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=1 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=41.329ms
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.907693Z lvl=info msg="TSI log compaction (start)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=1 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.907724Z lvl=error msg="Cannot open compacted index file" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=6 op_name=tsi1_compact_log_file tsi1_log_file_id=1 error="cannot allocate memory" path=/datadrive/influxdb2/engine/data/993a17731a4724e8/autogen/1999/index/5/L1-00000001.tsi
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.907732Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=6 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=41.310ms
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.907750Z lvl=info msg="TSI log compaction (start)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=6 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.912702Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=7 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=126.318ms
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.912729Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=5 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=140.578ms
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.907536Z lvl=error msg="Cannot open compacted index file" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=4 op_name=tsi1_compact_log_file tsi1_log_file_id=1 error="cannot allocate memory" path=/datadrive/influxdb2/engine/data/993a17731a4724e8/autogen/16570/index/3/L1-00000001.tsi
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.912748Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=4 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=47.034ms
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.912767Z lvl=info msg="TSI log compaction (start)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=4 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=start
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.907535Z lvl=error msg="Cannot open compacted index file" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=5 op_name=tsi1_compact_log_file tsi1_log_file_id=1 error="cannot allocate memory" path=/datadrive/influxdb2/engine/data/993a17731a4724e8/autogen/1999/index/4/L1-00000001.tsi
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.914721Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=3 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=144.798ms
Feb 29 19:53:32 biogasboard1eu influxd-systemd-start.sh[4864]: ts=2024-02-29T19:53:32.914726Z lvl=info msg="TSI log compaction (end)" log_id=0nebSz9l000 service=storage-engine index=tsi tsi1_partition=5 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=48.395ms
weshouck32 commented 7 months ago

@henningWoehr I'm facing the same issue. I think the databases are getting corrupt on migration. I tried adding more RAM but did not work either. Were you able to solve the issue?

henningWoehr commented 7 months ago

@weshouck32 No sadly not. I wanna try the manual upgrade, where you import each database one by one, when I have the time for that, but the upgrade is not the highest priority at the moment

weshouck32 commented 7 months ago

I was testing the manual method and seems to be a csv of the database. When I exported a database the filesize for the export was 10X more than the database size. I'm not sure if that is a good option

davidby-influx commented 7 months ago

The necessary clue to what is going on has been suppressed by your logging software:

Feb 29 19:48:32 biogasboard1eu influxd-systemd-start.sh[19059]: ts=2024-02-29T19:48:32.707772Z lvl=info msg="TSI log compaction (end)" log_id=0neLCLPW000 service=storage-engine index=tsi tsi1_partition=3 op_name=tsi1_compact_log_file tsi1_log_file_id=1 op_event=end op_elapsed=12298.524ms
Feb 29 19:49:02 biogasboard1eu systemd-journald[487]: Suppressed 1174405 messages from influxdb.service
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]:         /root/project/tsdb/index/tsi1/partition.go:960 +0x119 fp=0xc215c01fc8 sp=0xc215c01f30 pc=0x7fe5fa25b059
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: github.com/influxdata/influxdb/v2/tsdb/index/tsi1.(*Partition).Open.func3()
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]:         /root/project/tsdb/index/tsi1/partition.go:254 +0x26 fp=0xc215c01fe0 sp=0xc215c01fc8 pc=0x7fe5fa2553c6
Feb 29 19:49:02 biogasboard1eu influxd-systemd-start.sh[19059]: runtime.goexit()
davidby-influx commented 7 months ago

Have you considered using data export to compressed files and re-importation? That often works better than standard backups for large data sets.

https://docs.influxdata.com/enterprise_influxdb/v1/administration/backup-and-restore/#exporting-and-importing-data

henningWoehr commented 7 months ago

Have you considered using data export to compressed files and re-importation? That often works better than standard backups for large data sets.

https://docs.influxdata.com/enterprise_influxdb/v1/administration/backup-and-restore/#exporting-and-importing-data

@davidby-influx That's the same as described in the manual upgrade and that's what I wanna try next

weshouck32 commented 7 months ago

I cloned the production server again and ran the upgrade, this time it worked and the influxdb service started successfully. Still not sure why the other upgrade attempts failed.

henningWoehr commented 3 months ago

Hello again. After I found this https://github.com/influxdata/influxdb/issues/10939 issue, I noticed that we had the same problem, because we had about 90 databases which contained about I think 150 shards or so. Also the shards where very small, only a few mbs. After grouping the databases into much fewer, we now run influx v2 without any problems.