[0.9.5-nightly] Query performance

pkkummermo commented 8 years ago

Hi!

I've installed the nightly (now running at version 0.9.5-nightly-6682752) and I'm having huge performance issues which ultimately leads to a crash. I've tried to upgrade the nightly three times (the latest being version 0.9.5-nightly-6682752), just to see if there was a known bug which has been fixed.

I have the following schema sales:

Tags t1 (20-30 different strings), t2 (values from 0000 to 9999), t3 (values 0-160), t4 (10 different strings)
Values v1, v2, v3, v4

We're using the 2.0 of java-influxdb to report to the server, with maybe 1-2 inserts every 4 s and using Grafana to visualize by asking for data within the current day every 10s.

At first everything went fine, but after a while (24h) I found that the queries who took 18ms now took 200ms. Even later (next day) the queries took 5s(!). All of our data is approx 300MB in size. The server which is a local blade which has 64GB ram with 48c were running at ~100% on every core. Storage consist of 6 SSDs in RAID.

Is it something obvious I'm doing wrong? Is the schemas wrong? Is there a configuration which I've totally missed?

PKK

Configuration file:

### Welcome to the InfluxDB configuration file.

# Once every 24 hours InfluxDB will report anonymous data to m.influxdb.com
# The data includes raft id (random 8 bytes), os, arch, version, and metadata.
# We don't track ip addresses of servers reporting. This is only used
# to track the number of instances running and the versions, which
# is very helpful for us.
# Change this option to true to disable reporting.
reporting-disabled = false

###
### Enterprise registration control
###

[registration]
enabled = false
# url = "https://enterprise.influxdata.com" # The Enterprise server URL
# token = "" # Registration token for Enterprise server

###
### [meta]
###
### Controls the parameters for the Raft consensus group that stores metadata
### about the InfluxDB cluster.
###

[meta]
  dir = "/var/opt/influxdb/meta"
  hostname = "localhost"
  bind-address = ":8088"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"

###
### [data]
###
### Controls where the actual shard data for InfluxDB lives and how it is
### flushed from the WAL. "dir" may need to be changed to a suitable place
### for your system, but the WAL settings are an advanced configuration. The
### defaults should work for most systems.
###

[data]
  dir = "/opt/influxdb_data"

  # Controls the engine type for new shards. Options are b1, bz1, or tsm1.
  # b1 is the 0.9.2 storage engine, bz1 is the 0.9.3 and 0.9.4 engine.
  # tsm1 is the 0.9.5 engine
  engine ="tsm1"

  # The following WAL settings are for the b1 storage engine used in 0.9.2. They won't
  # apply to any new shards created after upgrading to a version > 0.9.3.
  max-wal-size = 104857600 # Maximum size the WAL can reach before a flush. Defaults to 100MB.
  wal-flush-interval = "10m" # Maximum time data can sit in WAL before a flush.
  wal-partition-flush-delay = "2s" # The delay time between each WAL partition being flushed.

  # These are the WAL settings for the storage engine >= 0.9.3
  wal-dir = "/var/opt/influxdb/wal"
  wal-enable-logging = true

  # When a series in the WAL in-memory cache reaches this size in bytes it is marked as ready to
  # flush to the index
  # wal-ready-series-size = 25600

  # Flush and compact a partition once this ratio of series are over the ready size
  # wal-compaction-threshold = 0.6

  # Force a flush and compaction if any series in a partition gets above this size in bytes
  # wal-max-series-size = 2097152

  # Force a flush of all series and full compaction if there have been no writes in this
  # amount of time. This is useful for ensuring that shards that are cold for writes don't
  # keep a bunch of data cached in memory and in the WAL.
  # wal-flush-cold-interval = "10m"

  # Force a partition to flush its largest series if it reaches this approximate size in
  # bytes. Remember there are 5 partitions so you'll need at least 5x this amount of memory.
  # The more memory you have, the bigger this can be.
  # wal-partition-size-threshold = 20971520

  # Whether queries should be logged before execution. Very useful for troubleshooting, but will
  # log any sensitive data contained within a query.
  # query-log-enabled = true

###
### [hinted-handoff]
###
### Controls the hinted handoff feature, which allows nodes to temporarily
### store queued data when one node of a cluster is down for a short period
### of time.
###

[hinted-handoff]
  enabled = true
  dir = "/var/opt/influxdb/hh"
  max-size = 1073741824
  max-age = "168h"
  retry-rate-limit = 0

  # Hinted handoff will start retrying writes to down nodes at a rate of once per second.
  # If any error occurs, it will backoff in an exponential manner, until the interval
  # reaches retry-max-interval. Once writes to all nodes are successfully completed the
  # interval will reset to retry-interval.
  retry-interval = "1s"
  retry-max-interval = "1m"

  # Interval between running checks for data that should be purged. Data is purged from
  # hinted-handoff queues for two reasons. 1) The data is older than the max age, or
  #2) the target node has been dropped from the cluster. Data is never dropped until
  # it has reached max-age however, for a dropped node or not.
  purge-interval = "1h"

###
### [cluster]
###
### Controls non-Raft cluster behavior, which generally includes how data is
### shared across shards.
###

[cluster]
  shard-writer-timeout = "10s" # The time within which a shard must respond to write.
  write-timeout = "5s" # The time within which a write operation must complete on the cluster.

###
### [retention]
###
### Controls the enforcement of retention policies for evicting old data.
###

[retention]
  enabled = true
  check-interval = "30m"

###
### Controls the system self-monitoring, statistics and diagnostics.
###
### The internal database for monitoring data is created automatically if
### if it does not already exist. The target retention within this database
### is called 'monitor' and is also created with a retention period of 7 days
### and a replication factor of 1, if it does not exist. In all cases the
### this retention policy is configured as the default for the database.

[monitor]
  store-enabled = true # Whether to record statistics internally.
  store-database = "_internal" # The destination database for recorded statistics
  store-interval = "10s" # The interval at which to record statistics

###
### [admin]
###
### Controls the availability of the built-in, web-based admin interface. If HTTPS is
### enabled for the admin interface, HTTPS must also be enabled on the [http] service.
###

[admin]
  enabled = true
  bind-address = ":8083"
  https-enabled = false
  https-certificate = "/etc/ssl/influxdb.pem"

###
### [http]
###
### Controls how the HTTP endpoints are configured. These are the primary
### mechanism for getting data into and out of InfluxDB.
###

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = false
  log-enabled = true
  write-tracing = false
  pprof-enabled = false
  https-enabled = false
  https-certificate = "/etc/ssl/influxdb.pem"

###
### [[graphite]]
###
### Controls one or many listeners for Graphite data.
###

[[graphite]]
  enabled = false
  # database = "graphite"
  # bind-address = ":2003"
  # protocol = "tcp"
  # consistency-level = "one"
  # name-separator = "."

  # These next lines control how batching works. You should have this enabled
  # otherwise you could get dropped metrics or poor performance. Batching
  # will buffer points in memory if you have many coming in.

  # batch-size = 1000 # will flush if this many points get buffered
  # batch-pending = 5 # number of batches that may be pending in memory
  # batch-timeout = "1s" # will flush at least this often even if we haven't hit buffer limit

  ## "name-schema" configures tag names for parsing the metric name from graphite protocol;
  ## separated by `name-separator`.
  ## The "measurement" tag is special and the corresponding field will become
  ## the name of the metric.
  ## e.g. "type.host.measurement.device" will parse "server.localhost.cpu.cpu0" as
  ## {
  ##     measurement: "cpu",
  ##     tags: {
  ##         "type": "server",
  ##         "host": "localhost,
  ##         "device": "cpu0"
  ##     }
  ## }
  # name-schema = "type.host.measurement.device"

  ## If set to true, when the input metric name has more fields than `name-schema` specified,
  ## the extra fields will be ignored.
  ## Otherwise an error will be logged and the metric rejected.
  # ignore-unnamed = true

###
### [collectd]
###
### Controls the listener for collectd data.
###

[collectd]
  enabled = false
  # bind-address = ""
  # database = ""
  # typesdb = ""

  # These next lines control how batching works. You should have this enabled
  # otherwise you could get dropped metrics or poor performance. Batching
  # will buffer points in memory if you have many coming in.

  # batch-size = 1000 # will flush if this many points get buffered
  # batch-pending = 5 # number of batches that may be pending in memory
  # batch-timeout = "1s" # will flush at least this often even if we haven't hit buffer limit

###
### [opentsdb]
###
### Controls the listener for OpenTSDB data.
###

[opentsdb]
  enabled = false
  # bind-address = ":4242"
  # database = "opentsdb"
  # retention-policy = ""
  # consistency-level = "one"
  # tls-enabled = false
  # certificate= ""

  # These next lines control how batching works. You should have this enabled
  # otherwise you could get dropped metrics or poor performance. Only points
  # metrics received over the telnet protocol undergo batching.

  # batch-size = 1000 # will flush if this many points get buffered
  # batch-pending = 5 # number of batches that may be pending in memory
  # batch-timeout = "1s" # will flush at least this often even if we haven't hit buffer limit

###
### [[udp]]
###
### Controls the listeners for InfluxDB line protocol data via UDP.
###

[[udp]]
  enabled = false
  # bind-address = ""
  # database = "udp"
  # retention-policy = ""

  # These next lines control how batching works. You should have this enabled
  # otherwise you could get dropped metrics or poor performance. Batching
  # will buffer points in memory if you have many coming in.

  # batch-size = 1000 # will flush if this many points get buffered
  # batch-pending = 5 # number of batches that may be pending in memory
  # batch-timeout = "1s" # will flush at least this often even if we haven't hit buffer limit

###
### [continuous_queries]
###
### Controls how continuous queries are run within InfluxDB.
###

### [shard-precreation]
###
### Controls the precreation of shards, so they are created before data arrives.
### Only shards that will exist in the future, at time of creation, are precreated.

[shard-precreation]
  enabled = true
  check-interval = "10m"
  advance-period = "30m"

[continuous_queries]
  log-enabled = true
  enabled = true
  recompute-previous-n = 2
  recompute-no-older-than = "10m"
  compute-runs-per-interval = 10
  compute-no-more-than = "2m"

otoolep commented 8 years ago

You have neglected to tell us the actual queries you are running. Please include details of those.

otoolep commented 8 years ago

You should also be aware that we are actively working on the query-engine performance and improvements can be expected over the next few weeks.

jackzampolin commented 8 years ago

@pkkummermo If sales is a measurement then thats very high series cardinality. Just some back of the envelope math ( 30 * 10,000 * 160 * 10 ) suggests around 480,000,000 possible series in that measurement. I would suggest lowering the number of possible tags and separating the data over two or more measurements.

Our testing indicates that best performance for your hardware set up is 100k series per measurement. Starting to get up to 1M series begins to seriously slow down both queries and writes.

pkkummermo commented 8 years ago

Hi! The queries are simple sum(v1) where time {{today}} group by time(1h). I suspected the tags being a the problem, but shouldn't 10k in terms of indexing be a piece of cake? I think there's a line in the doc that tags shouldn't exceed 100k, so I thought 10k was well within the "expected" limits.

otoolep commented 8 years ago

Can you show us the exact queries?

jackzampolin commented 8 years ago

@pkkummermo The problem isn't the one tag with 10k possible values. In combination with the other tags it creates a very large tag set and as a result a large number of series. So instead of indexing 10k values, InfluxDB is indexing 480 million values.

pkkummermo commented 8 years ago

Ah, so it's permutations. That would explain the behaviour. Thanks for the tip. I can provide exact queries tomorrow if they are of interest.

beckettsean commented 8 years ago

@pkkummermo please do share the queries, and if you can, the log statements showing the execution time. For example:

[query] 2015/11/04 13:07:55 SELECT mean(value) FROM "telegraf"."default".cpu_usage_guest WHERE time > now() - 1d GROUP BY time(1h)
[http] 2015/11/04 13:07:55 ::1 - - [04/Nov/2015:13:07:55 -0800] GET /query?db=telegraf&q=SELECT+MEAN%28value%29+FROM+cpu_usage_guest+WHERE+time+%3E+now%28%29+-+1d+GROUP+BY+time%281h%29 HTTP/1.1 200 197 - InfluxDBShell/0.9.4.1 1e45aab6-8338-11e5-8148-000000000000 1.737655ms

pkkummermo commented 8 years ago

The query just before the crash is as followed:

[tsm1wal] 2015/11/05 13:02:34 /opt/influxdb_data/incomestats/default/26 flush to index took 815.24874ms
[http] 2015/11/05 13:02:35 10.248.149.169 - root [05/Nov/2015:13:02:28 +0100] GET /query?db=incomestats&epoch=ms&q=SELECT+sum%28%22value%22%29+FROM+%22income_initial_price_10m%22+WHERE+%22client%22+%3D+%271999%27+AND+time+%3E+1446678000s+and+time+%3C+now%28%29 HTTP/1.1 200 122 http://heimdall:3000/dashboard/db/incomedash Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/45.0.2454.101 Chrome/45.0.2454.101 Safari/537.36 160a3c6b-83b5-11e5-8a6b-000000000000 6.468341299s
[query] 2015/11/05 13:02:35 SELECT initialPrice, monthlyPrice FROM "incomestats"."default".sales WHERE time > now() - 30s
[http] 2015/11/05 13:02:36 10.248.149.169 - root [05/Nov/2015:13:02:28 +0100] GET /query?db=incomestats&epoch=ms&q=SELECT+count%28%22value%22%29+FROM+%22income_initial_price_10m%22+WHERE+time+%3E+1446678000s+and+time+%3C+now%28%29 HTTP/1.1 200 122 http://heimdall:3000/dashboard/db/incomedash Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/45.0.2454.101 Chrome/45.0.2454.101 Safari/537.36 161eeb7b-83b5-11e5-8a6d-000000000000 7.303513539s
[http] 2015/11/05 13:02:36 10.248.149.169 - root [05/Nov/2015:13:02:28 +0100] GET /query?db=incomestats&epoch=ms&q=SELECT+sum%28%22value%22%29+AS+%22value%22+FROM+%22income_initial_price_10m%22+WHERE+time+%3E+1446678000s+and+time+%3C+now%28%29+GROUP+BY+time%281d%29 HTTP/1.1 200 130 http://heimdall:3000/dashboard/db/incomedash Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/45.0.2454.101 Chrome/45.0.2454.101 Safari/537.36 1609a814-83b5-11e5-8a67-000000000000 7.537070515s
[http] 2015/11/05 13:02:36 10.248.149.169 - root [05/Nov/2015:13:02:28 +0100] GET /query?db=incomestats&epoch=ms&q=SELECT+last%28%22initialPrice%22%29+FROM+%22sales%22+WHERE+time+%3E+1446678000s+and+time+%3C+now%28%29 HTTP/1.1 200 110 http://heimdall:3000/dashboard/db/incomedash Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/45.0.2454.101 Chrome/45.0.2454.101 Safari/537.36 160a33e6-83b5-11e5-8a69-000000000000 7.571324992s
[http] 2015/11/05 13:02:36 10.248.149.169 - root [05/Nov/2015:13:02:28 +0100] GET /query?db=incomestats&epoch=ms&q=SELECT+sum%28%22value%22%29+AS+%22Startpris%22+FROM+%22income_initial_price_10m%22+WHERE+time+%3E+1446678000s+and+time+%3C+now%28%29+GROUP+BY+time%281h%29 HTTP/1.1 200 177 http://heimdall:3000/dashboard/db/incomedash Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/45.0.2454.101 Chrome/45.0.2454.101 Safari/537.36 1609a817-83b5-11e5-8a68-000000000000 7.660962265s
[http] 2015/11/05 13:02:36 10.248.149.169 - root [05/Nov/2015:13:02:28 +0100] GET /query?db=incomestats&epoch=ms&q=SELECT+count%28%22initialPrice%22%29+AS+%22initialPrice%22+FROM+%22sales%22+WHERE+time+%3E+1446678000s+and+time+%3C+now%28%29 HTTP/1.1 200 114 http://heimdall:3000/dashboard/db/incomedash Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/45.0.2454.101 Chrome/45.0.2454.101 Safari/537.36 163396d5-83b5-11e5-8a6e-000000000000 7.50127786s

beckettsean commented 8 years ago

@pkkummermo since you are on tsm1 and have ~300MB of data, that translates to roughly 100,000,000 field values, assuming you aren't storing any strings. A SELECT COUNT() over 100 million points is going to take a while.

That said, it seems odd that the CPUs would be pegged as a COUNT is almost exclusively I/O bound. You mentioned "using Grafana to visualize by asking for data within the current day every 10s." I strongly recommend setting up Continuous Queries to downsample the data into a new measurement and then graph that in Grafana. That way you are pulling 100-1000 points from disk for each dashboard query, instead of 100k points.

pkkummermo commented 8 years ago

That sounds abit extreme, as we have v1,v2,v3,v4 per database entry with perhaps 120k entries total. The select count(*) is also limited by time, which would mean it didn't query all the data. I tried creating both a 10m continuous query as well as a 1h, but the performance improvement was miniscule.

My best guess is that the indexing of the tag set is my problem, especially if it's reindexing per insert and tag indexes are permutations between the available tags. The InfluxDB docs should have a section describing "best practices" for database design so other people doesn't run into the same problem as me :)

beckettsean commented 8 years ago

@pkkummermo totally agreed on the better docs for schema and performance. I did miss the time restrictions in your queries (I'm having a hard time reading queries today for some reason).

influxdata / influxdb

[0.9.5-nightly] Query performance #4654