[0.9.4.1] Shards not getting created, throwing write failed errors (throughput too high)

sebito91 commented 9 years ago

We have an 0.9.4.1 cluster (stable release) of three nodes with a default retention policy as defined by telegraf (v0.1.9, also stable release), and it seems unable to keep up with the data load. In particular, we're seeing thousands of messages the error below, and no extra shard created:

[tcp] 2015/10/01 12:07:16 process write shard error: write shard 1: engine: write points: write throughput too high. backoff and retry

Leaving the metrics to collect overnight, we returned in the morning to see one single shard at >95GB on each node. We've combed through a lot of the posts, including tweaks like #3885 to no avail. Even turning off HH and drastically reducing the wal knobs end up seeing a lot of gaps in the data, if it's committed at all.

Reading up on #4086 and similar threads, I'm curious if we're bumping up against the locking mechanism during wal flush. It seems that once we trigger a flush, we block all incoming read/write requests and just never recover. Unfortunately our setup represents only a fraction of the nodes we need to collect from so currently the backend is just unable to keep up.

Key Questions:

Can we turn off the wal at all, buffering into memory/cache instead? We have incredibly powerful machines (it's not a hardware problem) that can handle this type of throughput provided we can get it at the disks...
@parisholley, thanks! ~~Does shard-precreation actually work? Doesn't seem to generate more than 1 shard, ever?~~
To improve throughput does it make sense to tweak the LB to be sticky until a node dies? Will replication occur async to each of the cluster nodes?
When can we enable shards across more than 3 nodes, this is a massive limitation at this point and we can only scale up so far?
If you want to comb logs, do you have drop box we can send them to? We're quite eager to get this working properly...

CONFIG:

~1200 machines reporting through haproxy (roundrobin), ~60Mbit/sec
64bit RHEL 7.1
default telegraf collectors (v0.1.9 stable): cpu disk io mem net swap system
collect every 10s
three backend influxdb nodes (v0.9.4.1 stable, very beefy SSD-backed machines, 256GB RAM each)

### Welcome to the InfluxDB configuration file.

# Once every 24 hours InfluxDB will report anonymous data to m.influxdb.com
# The data includes raft id (random 8 bytes), os, arch, version, and metadata.
# We don't track ip addresses of servers reporting. This is only used
# to track the number of instances running and the versions, which
# is very helpful for us.
# Change this option to true to disable reporting.
reporting-disabled = true

###
### [meta]
###
### Controls the parameters for the Raft consensus group that stores metadata
### about the InfluxDB cluster.
###

[meta]
  dir = "/data/influxdb/meta"
  hostname = "testmetricsl6"
  bind-address = ":8088"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"
  cluster-tracing = true

###
### [data]
###
### Controls where the actual shard data for InfluxDB lives.
###

[data]
  dir = "/data/influxdb/metrics"
  max-wal-size = 10485760
  wal-flush-interval = "1m"
  wal-partition-flush-delay = "2s"
  wal-dir = "/data/influxdb/wal"
  wal-logging-enabled = true
  wal-ready-series-size = 1024 
  wal-compaction-threshold = 0.5
  wal-max-series-size = 1048576
  wal-flush-cold-interval = "5s"
  wal-partition-size-threshold = 10485760 

###
### [cluster]
###
### Controls non-Raft cluster behavior, which generally includes how data is
### shared across shards.
###

[cluster]
  force-remote-mapping = false 
  write-timeout = "2s"
  shard-writer-timeout = "2s"
  shard-mapper-timeout = "2s"

###
### [retention]
###
### Controls the enforcement of retention policies for evicting old data.
###

[retention]
  enabled = true
  check-interval = "10m0s"

[shard-precreation]
  enabled = true
  check-interval = "30s"
  advance-period = "60m0s"

###
### [admin]
###
### Controls the availability of the built-in, web-based admin interface.
###

[admin]
  enabled = true
  bind-address = ":8083"
  https-enabled = false
  https-certificate = "/etc/ssl/influxdb.pem"

###
### [http]
###
### Controls how the HTTP endpoints are configured. These are the primary
### mechanism for getting data into and out of InfluxDB.
###

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = false
  log-enabled = true
  write-tracing = false
  pprof-enabled = false
  https-enabled = false
  https-certificate = "/etc/ssl/influxdb.pem"

###
### [[graphite]]
###
### Controls one or many listeners for Graphite data.
###

[[graphite]]
  enabled = false
  bind-address = ":2003"
  database = "graphite"
  protocol = "tcp"
  batch-size = 1000
  batch-timeout = "1s"
  tcp-conn-timeout = "15m0s"
  consistency-level = "one"
  separator = "."

###
### [collectd]
###
### Controls the listener for collectd data.
###

[collectd]
  enabled = false
  bind-address = ":25826"
  database = "collectd"
  retention-policy = ""
  batch-size = 5000
  batch-pending = 5
  batch-timeout = "10s"
  typesdb = "/usr/share/collectd/types.db"

###
### [opentsdb]
###
### Controls the listener for OpenTSDB data.
###

[opentsdb]
  enabled = false
  bind-address = ":4242"
  database = "opentsdb"
  retention-policy = ""
  consistency-level = "one"
  tls-enabled = false
  certificate = "/etc/ssl/influxdb.pem"
  batch-size = 1000
  batch-pending = 5
  batch-timeout = "1s"

###
### [monitor]
###

[monitor]
  store-enabled = false 
  store-database = "_internal"
  store-interval = "10s"

###
### [continuous_queries]
###
### Controls how continuous queries are run within InfluxDB.
###

[continuous_queries]
  enabled = true
  log-enabled = true
  recompute-previous-n = 2
  recompute-no-older-than = "10m0s"
  compute-runs-per-interval = 10
  compute-no-more-than = "2m0s"

###
### [hinted-handoff]
###
### Controls the hinted handoff feature, which allows nodes to temporarily
### store queued data when one node of a cluster is down for a short period
### of time.
###

[hinted-handoff]
  enabled = false 
  dir = "/data/influxdb/hh"
  max-size = 1073741824
  max-age = "168h0m0s"
  retry-rate-limit = 0
  retry-interval = "1s"

sebito91 commented 9 years ago

Also referencing this Issue: https://github.com/influxdb/influxdb/issues/4240

parisholley commented 9 years ago

i also ran into an issue writing into the default store. it appears that sharding only happens when a retention policy is active (default is infinite). i ended up setting the default to 1 hour and using continuous queries to down sample into 1M groups for what i needed.

sebito91 commented 9 years ago

@parisholley did you allow telegraf to create the policy on its own, or did you manually adjust the params?

parisholley commented 9 years ago

@sebito91 not using telegraf, the default influxdb behavior is create the default retention policy when a new database is made. i ended up disabling that behavior (through conf) and creating it myself. make sure you run:

show shards

to validate it is going to expire within a the timeline you set, otherwise it will never preallocate or rotate

sebito91 commented 9 years ago

@parisholly, thanks! We'll let the data run for a while to see if the shards actually get generated:

> show retention policies on telegraf
name        duration    replicaN    default
default     0       3       false
telegraf    24h0m0s     3       true

> show shards
name: telegraf
--------------
id  start_time      end_time        expiry_time     owners
1   2015-09-28T00:00:00Z    2015-10-05T00:00:00Z    2015-10-05T00:00:00Z    2,3,1
2   2015-10-01T18:00:00Z    2015-10-01T19:00:00Z    2015-10-02T19:00:00Z    3,1,2
3   2015-10-01T19:00:00Z    2015-10-01T20:00:00Z    2015-10-02T20:00:00Z    1,2,3

name: _internal
---------------
id  start_time      end_time        expiry_time     owners
4   2015-10-01T00:00:00Z    2015-10-02T00:00:00Z    2015-10-09T00:00:00Z    2
5   2015-10-01T00:00:00Z    2015-10-02T00:00:00Z    2015-10-09T00:00:00Z    3
6   2015-10-01T00:00:00Z    2015-10-02T00:00:00Z    2015-10-09T00:00:00Z    1

sebito91 commented 9 years ago

So with @parisholley's guidance, we do see shards getting created on the hour (likely based on our config). That being said, the other 4 questions still remain and throughput issues are still present.

otoolep commented 9 years ago

Can we turn off the wal at all, buffering into memory/cache instead? We have incredibly powerful machines (it's not a hardware problem) that can handle this type of throughput provided we can get it at the disks...

This is not possible, and we are currently developing a new storage engine (which will be available in an experimental state very soon). Work on the new engine will be our priority.

Does shard-precreation actually work? Doesn't seem to generate more than 1 shard, ever?
To improve throughput does it make sense to tweak the LB to be sticky until a node dies? Will replication occur async to each of the cluster nodes?

Replication to other nodes is synchronous, but it does happen in parallel. I'm not quite sure what you're asking here.

When can we enable shards across more than 3 nodes, this is a massive limitation at this point and we can only scale up so far?

I understand the issue you are getting at, and we have discussed it here. Can you open a ticket, as a feature request?

If you want to comb logs, do you have drop box we can send them to? We're quite eager to get this working properly...

We are very aware that many people, when pushing the system, are experiencing issues with the current WAL. This is why you may be interested in the new engine, again experimental, which will be available soon. In the meantime you may like to switch back the "b1" engine, as some people report better operation with that build. You need to run a nightly or 0.9.4.2 to have control over the default engine.

https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L41

On Thu, Oct 1, 2015 at 10:38 AM, Sebastian Borza notifications@github.com wrote:

We have an 0.9.4.1 cluster (stable release) of three nodes with a default retention policy as defined by telegraf (v0.1.9, also stable release), and it seems unable to keep up with the data load. In particular, we're seeing thousands of messages the error below, and no extra shard created:

[tcp] 2015/10/01 12:07:16 process write shard error: write shard 1: engine: write points: write throughput too high. backoff and retry

Leaving the metrics to collect overnight, we returned in the morning to see one single shard at >95GB on each node. We've combed through a lot of the posts, including tweaks like #3885 https://github.com/influxdb/influxdb/issues/3885 to no avail. Even turning off HH and drastically reducing the wal knobs end up seeing a lot of gaps in the data, if it's committed at all.

Reading up on #4086 https://github.com/influxdb/influxdb/issues/4086 and similar threads, I'm curious if we're bumping up against the locking mechanism during wal flush. It seems that once we trigger a flush, we block all incoming read/write requests and just never recover. Unfortunately our setup represents only a fraction of the nodes we need to collect from so currently the backend is just unable to keep up.

Key Questions:

Can we turn off the wal at all, buffering into memory/cache instead? We have incredibly powerful machines (it's not a hardware problem) that can handle this type of throughput provided we can get it at the disks...

Does shard-precreation actually work? Doesn't seem to generate more than 1 shard, ever?

To improve throughput does it make sense to tweak the LB to be sticky until a node dies? Will replication occur async to each of the cluster nodes?

When can we enable shards across more than 3 nodes, this is a massive limitation at this point and we can only scale up so far?

If you want to comb logs, do you have drop box we can send them to? We're quite eager to get this working properly...

CONFIG:

~1200 machines reporting through haproxy (roundrobin), ~60Mbit/sec

64bit RHEL 7.1

default telegraf collectors (v0.1.9 stable): cpu disk io mem net swap system

collect every 10s

three backend influxdb nodes (v0.9.4.1 stable, very beefy SSD-backed machines, 256GB RAM each)

Welcome to the InfluxDB configuration file.

Once every 24 hours InfluxDB will report anonymous data to m.influxdb.com

The data includes raft id (random 8 bytes), os, arch, version, and metadata.

We don't track ip addresses of servers reporting. This is only used

to track the number of instances running and the versions, which

is very helpful for us.

Change this option to true to disable reporting.

reporting-disabled = true

[meta]

Controls the parameters for the Raft consensus group that stores metadata

about the InfluxDB cluster.

[meta] dir = "/data/influxdb/meta" hostname = "testmetricsl6" bind-address = ":8088" retention-autocreate = true election-timeout = "1s" heartbeat-timeout = "1s" leader-lease-timeout = "500ms" commit-timeout = "50ms" cluster-tracing = true

[data]

Controls where the actual shard data for InfluxDB lives.

[data] dir = "/data/influxdb/metrics" max-wal-size = 10485760 wal-flush-interval = "1m" wal-partition-flush-delay = "2s" wal-dir = "/data/influxdb/wal" wal-logging-enabled = true wal-ready-series-size = 1024 wal-compaction-threshold = 0.5 wal-max-series-size = 1048576 wal-flush-cold-interval = "5s" wal-partition-size-threshold = 10485760

[cluster]

Controls non-Raft cluster behavior, which generally includes how data is

shared across shards.

[cluster] force-remote-mapping = false write-timeout = "2s" shard-writer-timeout = "2s" shard-mapper-timeout = "2s"

[retention]

Controls the enforcement of retention policies for evicting old data.

[retention] enabled = true check-interval = "10m0s"

[shard-precreation] enabled = true check-interval = "30s" advance-period = "60m0s"

[admin]

Controls the availability of the built-in, web-based admin interface.

[admin] enabled = true bind-address = ":8083" https-enabled = false https-certificate = "/etc/ssl/influxdb.pem"

[http]

Controls how the HTTP endpoints are configured. These are the primary

mechanism for getting data into and out of InfluxDB.

[http] enabled = true bind-address = ":8086" auth-enabled = false log-enabled = true write-tracing = false pprof-enabled = false https-enabled = false https-certificate = "/etc/ssl/influxdb.pem"

[[graphite]]

Controls one or many listeners for Graphite data.

[[graphite]] enabled = false bind-address = ":2003" database = "graphite" protocol = "tcp" batch-size = 1000 batch-timeout = "1s" tcp-conn-timeout = "15m0s" consistency-level = "one" separator = "."

[collectd]

Controls the listener for collectd data.

[collectd] enabled = false bind-address = ":25826" database = "collectd" retention-policy = "" batch-size = 5000 batch-pending = 5 batch-timeout = "10s" typesdb = "/usr/share/collectd/types.db"

[opentsdb]

Controls the listener for OpenTSDB data.

[opentsdb] enabled = false bind-address = ":4242" database = "opentsdb" retention-policy = "" consistency-level = "one" tls-enabled = false certificate = "/etc/ssl/influxdb.pem" batch-size = 1000 batch-pending = 5 batch-timeout = "1s"

[monitor]

[monitor] store-enabled = false store-database = "_internal" store-interval = "10s"

[continuous_queries]

Controls how continuous queries are run within InfluxDB.

[continuous_queries] enabled = true log-enabled = true recompute-previous-n = 2 recompute-no-older-than = "10m0s" compute-runs-per-interval = 10 compute-no-more-than = "2m0s"

[hinted-handoff]

Controls the hinted handoff feature, which allows nodes to temporarily

store queued data when one node of a cluster is down for a short period

of time.

[hinted-handoff] enabled = false dir = "/data/influxdb/hh" max-size = 1073741824 max-age = "168h0m0s" retry-rate-limit = 0 retry-interval = "1s"

— Reply to this email directly or view it on GitHub https://github.com/influxdb/influxdb/issues/4289.

sebito91 commented 9 years ago

To rephrase my question about the load balancer, how does a write effectively mark itself as complete if we enable replication across each of the three nodes?

As I understand it today:

Open a socket for write on Node A
Write metrics into wal on Node A
Replicate metrics to each of Node B and Node C (maintaining initial connection)
Once receive ack for Node B + Node C copies, return ack to client
Close connection

I'm curious because I wonder if we end up blocking writes to the wal file until that sequence is done, for each and every connection that's made? Or, does it follow these lines:

Open a socket for write on Node A
Write metrics into wal on Node A
Close socket back to client on Node A
Replicate metrics across to Node B + Node C

If it's the former, then does it make sense to update the loadbalancer algorithm to something other than roundrobin (ie. hash destination and persist mapping until node dies)? If it's the latter, does it make sense to just have all writes go to one node and let raft handle the replication (ie, no need for loadbalancer at all)?

otoolep commented 9 years ago

It's the former, almost. The node receiving the data (the co-ordinator) writes to all nodes in parallel (one of which may be the node itself) and waits for all nodes to respond (or timeout). The co-ordinator doesn't do anything special like write to its WAL first. Each node gets the data and writes to its store.

If you want to spread load across the cluster, and your cluster is fully replicated, then pick a node at random for each write. Replication is handled for you whether you write to the same node forever, or jump around.

influxdata / influxdb

[0.9.4.1] Shards not getting created, throwing write failed errors (throughput too high) #4289

Welcome to the InfluxDB configuration file.

Once every 24 hours InfluxDB will report anonymous data to m.influxdb.com

The data includes raft id (random 8 bytes), os, arch, version, and metadata.

We don't track ip addresses of servers reporting. This is only used

to track the number of instances running and the versions, which

is very helpful for us.

Change this option to true to disable reporting.

[meta]

Controls the parameters for the Raft consensus group that stores metadata

about the InfluxDB cluster.

[data]

Controls where the actual shard data for InfluxDB lives.

[cluster]

Controls non-Raft cluster behavior, which generally includes how data is

shared across shards.

[retention]

Controls the enforcement of retention policies for evicting old data.

[admin]

Controls the availability of the built-in, web-based admin interface.

[http]

Controls how the HTTP endpoints are configured. These are the primary

mechanism for getting data into and out of InfluxDB.

[[graphite]]

Controls one or many listeners for Graphite data.

[collectd]

Controls the listener for collectd data.

[opentsdb]

Controls the listener for OpenTSDB data.

[monitor]

[continuous_queries]

Controls how continuous queries are run within InfluxDB.

[hinted-handoff]

Controls the hinted handoff feature, which allows nodes to temporarily

store queued data when one node of a cluster is down for a short period

of time.