influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics

https://influxdata.com

Apache License 2.0

28.95k stars 3.56k forks source link

General stability issues influxdb v1.0beta3 (possible compaction problems, still)? #7074

Closed sysadminsemantics closed 8 years ago

sysadminsemantics commented 8 years ago

Bug report

System info:

4 of these (non clustered, just mirroring all metrics data x 4 via telegraf, might shard or truly cluster these at some point): ProLiant XL170r Gen9 Intel(R) Xeon(R) CPU E5-2660 (40 cores hyperthreaded) 256GB of RAM 5 x 960GB SSD's (HP MK0960GECQK, apparently these are refurbished Samsungs), out of which:

4 of these are in RAID1+0, since the machines don't have a RAID controller, we're using software raid... -> xfs
1 of these drives, following design guidelines, used as a dedicated location for the WAL -> xfs
InfluxDB v1.0.0beta3
Debian Jessie 8.2
Steps to reproduce:
InfluxDB will be happily parsing all our metrics for a while. Current stats: -> 54 shards -> numMeasurements: 90 -> numSeries: 259848
a big event will occur, something like a power outage in a datacentre, stopping submission of metrics collection for a while
once metrics collection is resumed, the box in question will explode. io utilisation will rise to about 100%, the machine consuming all of its 40 threads
a restart of influxDB usually mitigates the issue (for a while). Current cluster has been running in a stable fashion for nearly 22 hours.
it "feels" like there is some queue'ing going on. WAL looks normal; the only thing that looks out of the ordinary is that the full compaction time suddenly spikes to be about 13 minutes on average. this could however also be a symptom of 100 io util
i've not been able so far to collect statistics about socket utilisation; amount of TIME_WAIT or CLOSE_WAIT sockets; which might be an interesting metric, but could also be highly symptomatic
i've collected a few reports of the cluster when it is actually behaving, but not when it is misbehaving. I will update this issue with that once this issue reoccurs (current reports are attached)
you'll see in a few reports that the max speed of the SSD's seems to be around 60 MB / sec. (and ioutil then being @ 100%) obviously that is very low, which is a separate hardware (and maybe software raid) issue we are investigating separately
Expected behavior:
no inexplicable loadspike / io lockup after metric submission has been stopped for a while
Actual behavior:
loadspike / io lockup

I've attached a few screenshots of my influxDB instance metrics; as well as the reports per node (both during and after the occurrence). Also; separate bug, did we remove UDP metrics from _internal?

screen shot 2016-07-27 at 11 32 28

The gist for the bugreport is here: https://gist.github.com/sysadminsemantics/18a9692acfeb417deba5b2ab7f67e825

You'll probably need more info, but if you have any pointers I could work with in the meanwhile, that would be greatly appreciated. Forgot to mention (but should be apparent in the graphs), that the bad state occurs around 12:30.

e-dard commented 8 years ago

Oliver's gist contains:

vars.txt
iostat.txt
goroutine profile
blocking profile

e-dard commented 8 years ago

@sysadminsemantics could you also provide (ideally on a node that's in the middle of causing trouble):

curl -o heap.txt "http://localhost:8086/debug/pprof/heap?debug=1"
influx -execute "show shards" > shards.txt
influx -execute "show stats" > stats.txt
influx -execute "show diagnostics" > diagnostics.txt

You can then drag and drop all the files as an issue comment. If you're concerned about exposing stats, shards or diagnostics then you can email those to my first name as it appears on my profile page @influxdb.com.

e-dard commented 8 years ago

@sysadminsemantics I just realised that the profiles in the gist are when the system is performing well? In that case we could also do with:

curl -o block.txt "http://localhost:8086/debug/pprof/block?debug=1" 
curl -o goroutine.txt "http://localhost:8086/debug/pprof/goroutine?debug=1" 
curl -o vars.txt "http://localhost:8086/debug/vars" 
iostat -xd 1 30 > iostat.txt

when it's performing badly, too 😄

sysadminsemantics commented 8 years ago

All nodes just starting having issues out of the blue; no major event that triggered it. block_sn1558_1.0.0-beta3.txt diagnostics_sn1558_1.0.0-beta3.txt goroutine_sn1558_1.0.0-beta3.txt heap_sn1558_1.0.0-beta3.txt iostat_sn1558_1.0.0-beta3.txt shards_sn1558_1.0.0-beta3.txt stats_sn1558_1.0.0-beta3.txt vars_sn1558_1.0.0-beta3.txt

See graphs as well: screen shot 2016-07-27 at 14 26 49

jwilder commented 8 years ago

@sysadminsemantics Something looks off with your disks. You mentioned these disks were SSDs, but the iostat data looks like a spinning disk profile. Each disk is sustaining about 110 IOPS and is fully utilized. There are also many goroutines blocked on file IO related tasks.

Are you able to test without the RAID setup?

sysadminsemantics commented 8 years ago

@jwilder thanks for the feedback.

I was able to test SSDs in several scenarios, and I think the conclusion and morale of the story is something as stupidly simple as queue'ing. We've completely overseen that Jessie by default enables CFQ for every single device; not ideal for IO intensive scenarios.

Newer fio tests reveal:

screen shot 2016-07-28 at 14 55 11

40K IOPS per disk without even having 100% disk util; and this is us just testing random writes.

Under normal circumstances, the InfluxDB nodes seem to be doing around this:

screen shot 2016-07-28 at 15 23 16

That's only about 100 IOPS, but also only 13% io util.

Furthermore; it would seem that after changing the queue'ing from CFQ to noop, the compaction lockup we experienced from the previous days no longer occurs: screen shot 2016-07-28 at 14 51 08

I'll still be monitoring the situation closely, but I'm hoping that just the simple queue'ing change might have solved this.

jwilder commented 8 years ago

@sysadminsemantics Interesting discovery. I'm going to close this out as it appears to be an system config issue. If you run into other issues, please log a new issue.

Thanks for all the detailed data as well. Really helps to understand what is going more easily.

sysadminsemantics commented 8 years ago

I'm afraid it happened again over the weekend, so we cheered slightly too early. The strange thing is; when I run an iostat while influxDB is in "bad" state; the IOPS reported by iostat are indeed very low (around the 100 mark, with 60 MB / sec. throughput per drive):

screen shot 2016-08-01 at 10 35 47

If in the meanwhile; while influxDB is in this bad state, i run an fio on my own and then run the iostat, the throughput increases..

screen shot 2016-08-01 at 10 39 31

Then, after an influxDB restart, the writes that InfluxDB does do are more performant than they were before, it seems like it is barely doing anything at all:

screen shot 2016-08-01 at 10 50 39

When the nodes are in this bad state; automatically the full compaction time increases to crazy numbers, probably because it cannot complete?

screen shot 2016-08-01 at 10 54 36

I'm really interested in the full compaction theme here. Is there a way to manually run this, to see if we can reproduce the error? If not, what details can you provide on what the thresholds and / or schedules are for it to run?

e-dard commented 8 years ago

I'm going to reopen this issue since previous data may still be useful in diagnosing this. /cc @jwilder

sysadminsemantics commented 8 years ago

Just an update that before going home last night, I removed both mirrored drives for the RAID1+0 array, to do some performance testing on that isolated InfluxDB nodes the next morning (mainly looking into chunk size / stripe alignment).

This morning when I came in, all nodes, with the exception of the node where I took out the mirrors, had the same issue as reported before, so software RAID is becoming more suspect.

At this point we're thinking of sharding the cluster and re-do'ing the whole set up altogether (maybe with ZFS), and maybe also with influxdb_relay on them; or on the other hand to actually give the commercial clustered version a spin.

In the meanwhile, are there any other people here that have handled production InfluxDB loads with software RAID and / or ZFS?

sysadminsemantics commented 8 years ago

More updates. This seems to happen systematically. I've started paying attention to what is actually happening in the logs as well. I'm actually noticing that there isn't a filesize difference after it is done compacting:

Case in point:

From the influxDB logs:

[tsm1] 2016/08/03 07:58:49 beginning full compaction of group 0, 2 TSM files
[tsm1] 2016/08/03 07:58:49 compacting full group (0) /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm (#0)
[tsm1] 2016/08/03 07:58:49 compacting full group (0) /mnt/influxdb/data/influxdb/default/73/000000478-000035230.tsm (#1)
[tsm1] 2016/08/03 07:59:10 compacted level 1 group (0) into /mnt/influxdb/data/influxdb/default/167/000000494-000000002.tsm.tmp (#0)
[tsm1] 2016/08/03 07:59:10 compacted level 1 group 0 of 2 files into 1 files in 51.076741817s

looking at the /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm file at the time of compaction, it is the following size: -rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:55 000000478-000035229.tsm

the logfiles appear to mention it writing it into a "tmp" file, which is the exact same size: -rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:59 000000478-000035231.tsm.tmp

and when it has finally done compacting, and i'm assuming it overwrites the original tsm1-file with the tmp one, it is still the same size: -rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 08:28 000000478-000035245.tsm

I can virtually reproduce this for any tsm-file on the filesystem. Which makes me question:

what is compaction intended to do?
will influxDB always attempt to compact a file, even though it cannot be compacted any further?
what is the mechanism for compaction?
is there a way to schedule compaction?

sysadminsemantics commented 8 years ago

So, to exclude the very poor filesystem performance on my nodes, I changed the filesystem on all of my nodes to ZFS. I've created a mirrored pool, so effectively I am running the same thing as RAID1+0.

So the IOPS are loads better than what they were before:

screen shot 2016-08-17 at 11 05 03

Please note that this IO spike, however, still only occurs when the node is doing a full compaction. Which brings me to my next point, once a node is in full compaction mode, it never seems to finish.

Whereas there are nodes doing a normal compaction during regular intervals, once the full compaction happens, it seems to keep a shitload of compaction tmp files in the datadirs, and does not get rid of them unless I restart InfluxDB.

Now, this has become less of an issue now, as due to the fact that the disks during this io spike aren't at 100% ioutil anymore, I actually have headroom for metrics still to be coming in and not have the "spikey" behavior I see when disks are 100% utilised.

All of this, however, cannot be deemed default behavior. I'm nearly considering writing an event handler in my monitoring system, checking on the system / cpu / disk load, amount of compaction tmp files, to initiate a restart of InfluxDB.

What information can you provide me in the meanwhile about how compaction works? Is there any other less hacky way of ensuring that this doesn't happen?

jwilder commented 8 years ago

@sysadminsemantics The tmp files are created by the compaction process and installed when the compaction complete. Once a TSM file is written, it is never modified. The tmp files are newly compacted TSM files that are not ready to be used just yet. We have various compaction stages that move data from a write optimized format into a read optimized format with higher levels of compression. A full compaction produces an optimal format for querying.

If you are restarting the server in the middle of a compaction, then the compactions will re-run the next time, but it will need to start over. A full compaction runs after a shard is cold (e.g. no writes) for 24hrs by default. These don't run frequently because they can be very CPU intensive and full compactions can take a long time due to a variety of factors. Other types of compactions run while the shard is accumulating new data.

In the issue description, you were expecting that there would be no IO or load spikes after writes have stopped. This is not the way the system works since shards are compacted and fully compacted after 24hrs. You can try lowering your [data].compact-full-write-cold-duration to 1hr to kick the process off more aggressively. If a write comes in to a shard that is fully compacted, it would be re-compacted after the cold duration threshold is exceeded. This can happen if you are writing data into the past or overwriting existing values. Doing either of those things makes compactions much more CPU intensive and slower.

jwilder commented 8 years ago

I'll try to answer some of your earlier questions:

looking at the /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm file at the time of compaction, it is the following size: -rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:55 000000478-000035229.tsm

the logfiles appear to mention it writing it into a "tmp" file, which is the exact same size: -rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:59 000000478-000035231.tsm.tmp

and when it has finally done compacting, and i'm assuming it overwrites the original tsm1-file with the tmp one, it is still the same size: -rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 08:28 000000478-000035245.tsm

This is a different file than was being compacted 000000478-000035231.tsm.tmp would have produced 000000478-000035231.tsm when completed. The sizes are all 2GB because that is the maximum size of a TSM file before we split it into a second file. The contents of each of these tmp files would be the combination of files used for the compaction. Blocks within the file would be more dense, series data will be sorted and have better locality, index would be better organized and deleted data would be removed.

I can virtually reproduce this for any tsm-file on the filesystem. Which makes me question:

what is compaction intended to do?

Compactions convert series data from a write optimized, less compact format to read optimized compressed format. They also expunge deleted data and help to keep larger shards able to be loaded quickly.

will influxDB always attempt to compact a file, even though it cannot be compacted any further?

No. Once a shard is fully compacted and no new writes to the shard occur. It will not be re-compacted.

what is the mechanism for compaction?

I don't understand the question.

is there a way to schedule compaction?

No, they happen automatically to keep the shard both queryable and writeable. You can lower some of the thresholds used to trigger snapshot and full compactions, but I would not recommend changing the defaults without a good reason.

jwilder commented 8 years ago

Should be fixed via #7258

influxdata / influxdb

General stability issues influxdb v1.0beta3 (possible compaction problems, still)? #7074

Bug report

System info:

Steps to reproduce:

Expected behavior:

Actual behavior: