influxdata / docs-v2

InfluxData Documentation that covers InfluxDB Cloud, InfluxDB OSS 2.x, InfluxDB OSS 1.x, InfluxDB Enterprise, Telegraf, Chronograf, Kapacitor, and Flux.
https://docs.influxdata.com
MIT License
72 stars 269 forks source link

Warning about large numbers of databases on Enterprise Clusters #3108

Open xe-nvdk opened 3 years ago

xe-nvdk commented 3 years ago

Large numbers of databases on a cluster can indeed cause performance issues. The primary driver of this behavior is total shard count although there are other contributing factors. As a general rule, it is best to stay below 1000 shards per node on a cluster. Let’s take a look at a few possible arrangements to see how database count affects shard count.

The number of databases/(RP) is a direct multiple for the number of shards

2 node, 5 databases, RF 2, 52 Week retention, 1 week shard duration = ~520 shards per node
6 node, 5 databases, RF 2, 52 Week retention, 1 week shard duration = ~520 shards per node

2 node, 50 databases, RF 2, 52 Week retention, 1 week shard duration = ~5200 shards per node
6 node, 50 databases, RF 2, 52 Week retention, 1 week shard duration = ~5200 shards per node

Longer shard group durations will help but generally not enough. They also bring higher costs to query the data, a larger memory footprint, higher disk IOPS/throughput usage, and more difficult compactions.

2 node, 5 databases, RF 2, 52 Week retention, 2 week shard duration = ~260 shards per node
2 node, 5 databases, RF 2, 52 Week retention, 4 week shard duration = ~130 shards per node

2 node, 50 databases, RF 2, 52 Week retention, 2 week shard duration = ~2600 shards per node
2 node, 50 databases, RF 2, 52 Week retention, 4 week shard duration = ~1300 shards per node

Greater replication will also hurt as the size of shards on each node will increase. Unbalanced replication (5 node cluster with RF=2) will cause a dramatic increase in shard counts. Large amounts of shards also cause a significant issue in the performance of TSI. At counts of greater than 1000, we should expect any performance gained by TSI to be outweighed by the overhead.

The effects of a high shard count can be mitigated by adding more resources but this will inevitably hit bounds imposed by either hardware availability, hosting costs, or license limitations.

In short - Lots of shards can increase the resource requirements needed to run the cluster.

one more thing to mention - hot shards are more impactful than cold shards. More databases = more hot shards at any given moment

from InfluxData Support

lwandzura commented 2 years ago

@kelseiv My recommendation is to create a separate best practices doc for enterprise so that all best practices are on one page - thoughts?