influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.71k stars 3.54k forks source link

influxdb suddenly stops compacting a shard #20627

Open phill84 opened 3 years ago

phill84 commented 3 years ago

Steps to reproduce:

  1. restore a backup of the problematic shard

Expected behavior: This database is sharded by day and has about 200GB of data per day. There isn't more points/tags/measurements in this problematic shard so I would expect it to be around 200GB.

Actual behavior: This shard stays at 580GB and doesn't shrink even after being cold for days. Usually we have a bit more than 100 tsm files in a shard. But for this day there are more than 12k tsm, most of which are at compaction level 1. The biggest tsm file group stays at level 2 (with 82 tsm files and 174GB in total) for days.

Environment info:

Config:

[data]
  dir = "/data/influx/data"
  wal-dir = "/data/influx/wal"
  index-version = "tsi1"
  query-log-enabled = true
  cache-max-memory-size = "5g"
  max-concurrent-compactions = 5
  compact-throughput = "200m"
  compact-throughput-burst = "400m"

This problematic shard is from January 19th. The amount of data was growing at a steady rate until 17:30, after which the diskBytes went skyrocketing. As I mentioned above, this database is sharded by day. Here is a chart of diskBytes from the Jan 18th shard (id 676) and the 19th (id 679). diskBytes

After 17:30, there wasn't any (new) Level1 to Full compaction happening to the Jan 19th shard. compactions

During the days after I do see in influxdb logs that it tried to compact this shard, but the biggest group never got higher than level 2. The whole influxdb server became really slow due to queries on this shard, so I have dropped it on the production database. Here is the content of a restored backup of this day. https://gist.github.com/phill84/6b795531b91625fdeacf4be880833eb7

phill84 commented 3 years ago

So it looks like compaction "stopped" because I have set max-concurrent-compactions = 5 and at that time all 5 slots were taken by some old shards, possibly because of some backloading of data from previous weeks. However it still seems strange to me that after restoring this problematic shards to a spare server it still doesn't get compacted.