Closed jwilder closed 6 years ago
Proposal added for TSI (Time-Series Index) file format: https://github.com/influxdata/influxdb/pull/7174
Problem statement/requirements docs: #7151
We are getting hit but this pretty hard and am wondering if there is any way we can prevent influx from consuming all ram and then getting killed. Is there some setting we can tweak to help this. I'd be happy with lowering performance if it meant that the service stayed up
For reference: https://twitter.com/lisiewski/status/793504279063506944
@sorrison 1.1 has a number of memory improvements related to queries, but high memory usage in queries or writes is usually due to schema design issues. Two common problems are querying across too many shards (e.g. shard duration is too low) as well as writing high cardinality tag values and querying too many series at once.
There are a few limits you can enable to prevent high cardinality data from being written or being queried.
In 1.0, there is max-series-per-database
which will limit the number of series per database to 1M by default.
[data]
max-series-per-database = 1000000
In 1.1, there is a max-values-per-tag
limit that drops values that would cause the cardinality of any one tag to exceed the limit:
[data]
max-values-per-tag = 100000
For queries, there are a few others:
[coordinator]
max-concurrent-queries = 0 # limits the number concurrently running queries
query-timeout = "0s" # limits the length of time a query can execute before being killed
log-queries-after = "0s" # logs queries that run longer than the threshold
max-select-point = 0 # kills any queries that too many points
max-select-series = 0 # kills queries that would involve selecting from too many series at once
max-select-buckets = 0 # kills queries that would create too many group by buckets
If you are having performance issues, please log a new issue using the instructions for a bug report. In order to help, we need all the information requested in the instructions.
Thanks @jwilder I am currently developing a driver for Gnocchi (part of openstack) https://github.com/openstack/gnocchi and am dealing with a large amount of data. Basically I have lots of metrics going into influx, originally I put each metric into it's own measurement but I wanted to do 3 levels of downsampling so I didn't want to have 3 continuous queries per measurement (we have in the order of 100,000s of metrics). So now they all go into one measurement with a tag for metric id and I run the continuous queries on the one measurement.
I thought having more tag values would be better than having more continuous queries?
Sorry for putting this all in this bug. Is there a better place to discuss these kind of things? IRC?
Just installed the 1.1 RC and working good so far although it takes about a week for it to die and need restarted at the moment. (We are running on a host with 24 cores and 96G RAM)
@sorrison I tried doing something similar earlier this year with influx. In the end I have grouped together related metrics into separate measurements.
I also moved away from continuous queries and I build the downsampled data at the same time, this seems to work really well.
Although I am still looking forward to the tag index being cached to disk as at the moment I am storing the data over three separate influxdb instances.
@ivanscattergood When you said that you have built the downsampled data at the same time, you mean that you execute a query and then save the aggregated results into a different retention policy. If so, how do you schedule that query?
Thanks in advance.
Hi,
I use a java client to collect the data and I aggregate it within that code.
I save one summary of data every minute and then a summary every hour.
Currently this allows me to visualise 7 million unique series from 3 months down to 1 minute.
We are expecting to treble the amount of data we visualise over the next 3 months.
Ivan
Hi Ivan, Thanks for your quick reply. When the java client collect the data, do you execute just one query to retrieve all data, or do you execute multiple queries? I have 200K devices (each with 10 metrics), every 5 minutes they collect data for all devices, so every 5 minutes I have 200K data points, each data point have a tag (deviceId), and 10 fields (one for each metric). If I try to compress data every hour (2.4M data points) using Continuous Query, either it never returns or it might crash the server. I wonder how you get the data using your java client.
Thanks in advance.
Hi,
I cache the data in the java client rather than re-querying the data. I was using an earlier version of Influxdb at the time I made that change (version 0.9) and I did this to work around the DB crashing.
I see, so no queries to retrieve the data. BTW, Are you still using InfluxDB?
Thanks.
Yes still using influxdb
This appears to be a problem for things such as Heapster (kubernetes/heapster#605) & Kubernetes (kubernetes/kubernetes#27630) metrics which appear to use a lot of tags. Based on the pod memory usage pattern for InfluxDB when running in a Kubernetes cluster with Heapster populating data into InfluxDB, it appears that it begins to use a lot of memory the more activity in the cluster is happening. (Therefore more metrics stored & ephemeral pods are started & stopped creating more tags, using more memory until hitting the OOM limit). At this point Kubernetes shows: Last State: Terminated
, Reason: OOMKilled
and the pod restarts to enjoy it's next limited lifespan until the next OOMKilled event.
@trinitronx that's one of the key use cases this is designed to support
Do you know when this will be available in nightly builds?
@ivanscattergood there's been significant work on this so hopefully soon. No set date though.
This feature would really help with handling clickstream data :)
Storage and query level support is available in nightly and will be present for opt-in in 1.3.0. There is additional work required to support SHOW
commands for high cardinality data and to integrate some enterprise auth features into TSI.
I'm removing this issue from the 1.3.0 milestone and leaving it open for 1.4 / future work where we will finish up the remaining bits and enable TSI by default.
More information on the current state is available on the blog: https://www.influxdata.com/path-1-billion-time-series-influxdb-high-cardinality-indexing-ready-testing/
TSI shipped in 1.5. It is not currently enabled by default.
Feature Request
The database should be able to support higher levels of cardinality for tags and series. Currently, the full tag set is loaded into an in-memory index for fast query planning. When tags with a large number of values are written, the in-memory index can consume more memory than is available on the host.
Proposal:
The database should not require loading the full tag set into an in-memory index. Higher cardinality series and tags should be able to be stored and queried and not be limited by the amount of RAM on the host.
Current behavior:
Currently, high cardinality data causes the process memory usage to grow quickly increasing the chances of an OOM. It also slows startup times as the the index needs to scan all the stored data to re-create the in-memory index.
Users also frequently write high cardinality tag data by mistake causing the server to crash. When in this state, removing the problem data is very difficult as well.
Desired behavior:
Storing high-cardinality data should not cause the process to OOM or adversely affect startup times. Query performance should not be adversely affected by higher cardinality data as well.
Use case:
It is more natural and convenient to be able to store higher cardinality data at times. For example, some tag data is ephemeral in nature (docker containers IDs), but can contribute to high cardinality data issues over time.
Documentation