Closed sebito91 closed 9 years ago
Also referencing this Issue: https://github.com/influxdb/influxdb/issues/4240
i also ran into an issue writing into the default store. it appears that sharding only happens when a retention policy is active (default is infinite). i ended up setting the default to 1 hour and using continuous queries to down sample into 1M groups for what i needed.
@parisholley did you allow telegraf to create the policy on its own, or did you manually adjust the params?
@sebito91 not using telegraf, the default influxdb behavior is create the default retention policy when a new database is made. i ended up disabling that behavior (through conf) and creating it myself. make sure you run:
show shards
to validate it is going to expire within a the timeline you set, otherwise it will never preallocate or rotate
@parisholly, thanks! We'll let the data run for a while to see if the shards actually get generated:
> show retention policies on telegraf
name duration replicaN default
default 0 3 false
telegraf 24h0m0s 3 true
> show shards
name: telegraf
--------------
id start_time end_time expiry_time owners
1 2015-09-28T00:00:00Z 2015-10-05T00:00:00Z 2015-10-05T00:00:00Z 2,3,1
2 2015-10-01T18:00:00Z 2015-10-01T19:00:00Z 2015-10-02T19:00:00Z 3,1,2
3 2015-10-01T19:00:00Z 2015-10-01T20:00:00Z 2015-10-02T20:00:00Z 1,2,3
name: _internal
---------------
id start_time end_time expiry_time owners
4 2015-10-01T00:00:00Z 2015-10-02T00:00:00Z 2015-10-09T00:00:00Z 2
5 2015-10-01T00:00:00Z 2015-10-02T00:00:00Z 2015-10-09T00:00:00Z 3
6 2015-10-01T00:00:00Z 2015-10-02T00:00:00Z 2015-10-09T00:00:00Z 1
So with @parisholley's guidance, we do see shards getting created on the hour (likely based on our config). That being said, the other 4 questions still remain and throughput issues are still present.
This is not possible, and we are currently developing a new storage engine (which will be available in an experimental state very soon). Work on the new engine will be our priority.
Replication to other nodes is synchronous, but it does happen in parallel. I'm not quite sure what you're asking here.
I understand the issue you are getting at, and we have discussed it here. Can you open a ticket, as a feature request?
We are very aware that many people, when pushing the system, are experiencing issues with the current WAL. This is why you may be interested in the new engine, again experimental, which will be available soon. In the meantime you may like to switch back the "b1" engine, as some people report better operation with that build. You need to run a nightly or 0.9.4.2 to have control over the default engine.
https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L41
On Thu, Oct 1, 2015 at 10:38 AM, Sebastian Borza notifications@github.com wrote:
We have an 0.9.4.1 cluster (stable release) of three nodes with a default retention policy as defined by telegraf (v0.1.9, also stable release), and it seems unable to keep up with the data load. In particular, we're seeing thousands of messages the error below, and no extra shard created:
[tcp] 2015/10/01 12:07:16 process write shard error: write shard 1: engine: write points: write throughput too high. backoff and retry
Leaving the metrics to collect overnight, we returned in the morning to see one single shard at >95GB on each node. We've combed through a lot of the posts, including tweaks like #3885 https://github.com/influxdb/influxdb/issues/3885 to no avail. Even turning off HH and drastically reducing the wal knobs end up seeing a lot of gaps in the data, if it's committed at all.
Reading up on #4086 https://github.com/influxdb/influxdb/issues/4086 and similar threads, I'm curious if we're bumping up against the locking mechanism during wal flush. It seems that once we trigger a flush, we block all incoming read/write requests and just never recover. Unfortunately our setup represents only a fraction of the nodes we need to collect from so currently the backend is just unable to keep up.
Key Questions:
- Can we turn off the wal at all, buffering into memory/cache instead? We have incredibly powerful machines (it's not a hardware problem) that can handle this type of throughput provided we can get it at the disks...
- Does shard-precreation actually work? Doesn't seem to generate more than 1 shard, ever?
- To improve throughput does it make sense to tweak the LB to be sticky until a node dies? Will replication occur async to each of the cluster nodes?
- When can we enable shards across more than 3 nodes, this is a massive limitation at this point and we can only scale up so far?
- If you want to comb logs, do you have drop box we can send them to? We're quite eager to get this working properly...
CONFIG:
- ~1200 machines reporting through haproxy (roundrobin), ~60Mbit/sec
- 64bit RHEL 7.1
- default telegraf collectors (v0.1.9 stable): cpu disk io mem net swap system
- collect every 10s
- three backend influxdb nodes (v0.9.4.1 stable, very beefy SSD-backed machines, 256GB RAM each)
Welcome to the InfluxDB configuration file.
Once every 24 hours InfluxDB will report anonymous data to m.influxdb.com
The data includes raft id (random 8 bytes), os, arch, version, and metadata.
We don't track ip addresses of servers reporting. This is only used
to track the number of instances running and the versions, which
is very helpful for us.
Change this option to true to disable reporting.
reporting-disabled = true
[meta]
Controls the parameters for the Raft consensus group that stores metadata
about the InfluxDB cluster.
[meta] dir = "/data/influxdb/meta" hostname = "testmetricsl6" bind-address = ":8088" retention-autocreate = true election-timeout = "1s" heartbeat-timeout = "1s" leader-lease-timeout = "500ms" commit-timeout = "50ms" cluster-tracing = true
[data]
Controls where the actual shard data for InfluxDB lives.
[data] dir = "/data/influxdb/metrics" max-wal-size = 10485760 wal-flush-interval = "1m" wal-partition-flush-delay = "2s" wal-dir = "/data/influxdb/wal" wal-logging-enabled = true wal-ready-series-size = 1024 wal-compaction-threshold = 0.5 wal-max-series-size = 1048576 wal-flush-cold-interval = "5s" wal-partition-size-threshold = 10485760
[cluster]
Controls non-Raft cluster behavior, which generally includes how data is
shared across shards.
[cluster] force-remote-mapping = false write-timeout = "2s" shard-writer-timeout = "2s" shard-mapper-timeout = "2s"
[retention]
Controls the enforcement of retention policies for evicting old data.
[retention] enabled = true check-interval = "10m0s"
[shard-precreation] enabled = true check-interval = "30s" advance-period = "60m0s"
[admin]
Controls the availability of the built-in, web-based admin interface.
[admin] enabled = true bind-address = ":8083" https-enabled = false https-certificate = "/etc/ssl/influxdb.pem"
[http]
Controls how the HTTP endpoints are configured. These are the primary
mechanism for getting data into and out of InfluxDB.
[http] enabled = true bind-address = ":8086" auth-enabled = false log-enabled = true write-tracing = false pprof-enabled = false https-enabled = false https-certificate = "/etc/ssl/influxdb.pem"
[[graphite]]
Controls one or many listeners for Graphite data.
[[graphite]] enabled = false bind-address = ":2003" database = "graphite" protocol = "tcp" batch-size = 1000 batch-timeout = "1s" tcp-conn-timeout = "15m0s" consistency-level = "one" separator = "."
[collectd]
Controls the listener for collectd data.
[collectd] enabled = false bind-address = ":25826" database = "collectd" retention-policy = "" batch-size = 5000 batch-pending = 5 batch-timeout = "10s" typesdb = "/usr/share/collectd/types.db"
[opentsdb]
Controls the listener for OpenTSDB data.
[opentsdb] enabled = false bind-address = ":4242" database = "opentsdb" retention-policy = "" consistency-level = "one" tls-enabled = false certificate = "/etc/ssl/influxdb.pem" batch-size = 1000 batch-pending = 5 batch-timeout = "1s"
[monitor]
[monitor] store-enabled = false store-database = "_internal" store-interval = "10s"
[continuous_queries]
Controls how continuous queries are run within InfluxDB.
[continuous_queries] enabled = true log-enabled = true recompute-previous-n = 2 recompute-no-older-than = "10m0s" compute-runs-per-interval = 10 compute-no-more-than = "2m0s"
[hinted-handoff]
Controls the hinted handoff feature, which allows nodes to temporarily
store queued data when one node of a cluster is down for a short period
of time.
[hinted-handoff] enabled = false dir = "/data/influxdb/hh" max-size = 1073741824 max-age = "168h0m0s" retry-rate-limit = 0 retry-interval = "1s"
— Reply to this email directly or view it on GitHub https://github.com/influxdb/influxdb/issues/4289.
To rephrase my question about the load balancer, how does a write effectively mark itself as complete if we enable replication across each of the three nodes?
As I understand it today:
I'm curious because I wonder if we end up blocking writes to the wal file until that sequence is done, for each and every connection that's made? Or, does it follow these lines:
If it's the former, then does it make sense to update the loadbalancer algorithm to something other than roundrobin (ie. hash destination and persist mapping until node dies)? If it's the latter, does it make sense to just have all writes go to one node and let raft handle the replication (ie, no need for loadbalancer at all)?
It's the former, almost. The node receiving the data (the co-ordinator) writes to all nodes in parallel (one of which may be the node itself) and waits for all nodes to respond (or timeout). The co-ordinator doesn't do anything special like write to its WAL first. Each node gets the data and writes to its store.
If you want to spread load across the cluster, and your cluster is fully replicated, then pick a node at random for each write. Replication is handled for you whether you write to the same node forever, or jump around.
We have an 0.9.4.1 cluster (stable release) of three nodes with a default retention policy as defined by telegraf (v0.1.9, also stable release), and it seems unable to keep up with the data load. In particular, we're seeing thousands of messages the error below, and no extra shard created:
[tcp] 2015/10/01 12:07:16 process write shard error: write shard 1: engine: write points: write throughput too high. backoff and retry
Leaving the metrics to collect overnight, we returned in the morning to see one single shard at >95GB on each node. We've combed through a lot of the posts, including tweaks like #3885 to no avail. Even turning off HH and drastically reducing the wal knobs end up seeing a lot of gaps in the data, if it's committed at all.
Reading up on #4086 and similar threads, I'm curious if we're bumping up against the locking mechanism during wal flush. It seems that once we trigger a flush, we block all incoming read/write requests and just never recover. Unfortunately our setup represents only a fraction of the nodes we need to collect from so currently the backend is just unable to keep up.
Key Questions:
Does shard-precreation actually work? Doesn't seem to generate more than 1 shard, ever?CONFIG: