Closed sysadminsemantics closed 8 years ago
Oliver's gist contains:
@sysadminsemantics could you also provide (ideally on a node that's in the middle of causing trouble):
curl -o heap.txt "http://localhost:8086/debug/pprof/heap?debug=1"
influx -execute "show shards" > shards.txt
influx -execute "show stats" > stats.txt
influx -execute "show diagnostics" > diagnostics.txt
You can then drag and drop all the files as an issue comment. If you're concerned about exposing stats, shards or diagnostics then you can email those to my first name as it appears on my profile page @influxdb.com
.
@sysadminsemantics I just realised that the profiles in the gist are when the system is performing well? In that case we could also do with:
curl -o block.txt "http://localhost:8086/debug/pprof/block?debug=1"
curl -o goroutine.txt "http://localhost:8086/debug/pprof/goroutine?debug=1"
curl -o vars.txt "http://localhost:8086/debug/vars"
iostat -xd 1 30 > iostat.txt
when it's performing badly, too đŸ˜„
All nodes just starting having issues out of the blue; no major event that triggered it. block_sn1558_1.0.0-beta3.txt diagnostics_sn1558_1.0.0-beta3.txt goroutine_sn1558_1.0.0-beta3.txt heap_sn1558_1.0.0-beta3.txt iostat_sn1558_1.0.0-beta3.txt shards_sn1558_1.0.0-beta3.txt stats_sn1558_1.0.0-beta3.txt vars_sn1558_1.0.0-beta3.txt
See graphs as well:
@sysadminsemantics Something looks off with your disks. You mentioned these disks were SSDs, but the iostat data looks like a spinning disk profile. Each disk is sustaining about 110 IOPS and is fully utilized. There are also many goroutines blocked on file IO related tasks.
Are you able to test without the RAID setup?
@jwilder thanks for the feedback.
I was able to test SSDs in several scenarios, and I think the conclusion and morale of the story is something as stupidly simple as queue'ing. We've completely overseen that Jessie by default enables CFQ for every single device; not ideal for IO intensive scenarios.
Newer fio tests reveal:
40K IOPS per disk without even having 100% disk util; and this is us just testing random writes.
Under normal circumstances, the InfluxDB nodes seem to be doing around this:
That's only about 100 IOPS, but also only 13% io util.
Furthermore; it would seem that after changing the queue'ing from CFQ to noop, the compaction lockup we experienced from the previous days no longer occurs:
I'll still be monitoring the situation closely, but I'm hoping that just the simple queue'ing change might have solved this.
@sysadminsemantics Interesting discovery. I'm going to close this out as it appears to be an system config issue. If you run into other issues, please log a new issue.
Thanks for all the detailed data as well. Really helps to understand what is going more easily.
I'm afraid it happened again over the weekend, so we cheered slightly too early. The strange thing is; when I run an iostat while influxDB is in "bad" state; the IOPS reported by iostat are indeed very low (around the 100 mark, with 60 MB / sec. throughput per drive):
If in the meanwhile; while influxDB is in this bad state, i run an fio on my own and then run the iostat, the throughput increases..
Then, after an influxDB restart, the writes that InfluxDB does do are more performant than they were before, it seems like it is barely doing anything at all:
When the nodes are in this bad state; automatically the full compaction time increases to crazy numbers, probably because it cannot complete?
I'm really interested in the full compaction theme here. Is there a way to manually run this, to see if we can reproduce the error? If not, what details can you provide on what the thresholds and / or schedules are for it to run?
I'm going to reopen this issue since previous data may still be useful in diagnosing this. /cc @jwilder
Just an update that before going home last night, I removed both mirrored drives for the RAID1+0 array, to do some performance testing on that isolated InfluxDB nodes the next morning (mainly looking into chunk size / stripe alignment).
This morning when I came in, all nodes, with the exception of the node where I took out the mirrors, had the same issue as reported before, so software RAID is becoming more suspect.
At this point we're thinking of sharding the cluster and re-do'ing the whole set up altogether (maybe with ZFS), and maybe also with influxdb_relay on them; or on the other hand to actually give the commercial clustered version a spin.
In the meanwhile, are there any other people here that have handled production InfluxDB loads with software RAID and / or ZFS?
More updates. This seems to happen systematically. I've started paying attention to what is actually happening in the logs as well. I'm actually noticing that there isn't a filesize difference after it is done compacting:
Case in point:
From the influxDB logs:
[tsm1] 2016/08/03 07:58:49 beginning full compaction of group 0, 2 TSM files
[tsm1] 2016/08/03 07:58:49 compacting full group (0) /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm (#0)
[tsm1] 2016/08/03 07:58:49 compacting full group (0) /mnt/influxdb/data/influxdb/default/73/000000478-000035230.tsm (#1)
[tsm1] 2016/08/03 07:59:10 compacted level 1 group (0) into /mnt/influxdb/data/influxdb/default/167/000000494-000000002.tsm.tmp (#0)
[tsm1] 2016/08/03 07:59:10 compacted level 1 group 0 of 2 files into 1 files in 51.076741817s
looking at the /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm file at the time of compaction, it is the following size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:55 000000478-000035229.tsm
the logfiles appear to mention it writing it into a "tmp" file, which is the exact same size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:59 000000478-000035231.tsm.tmp
and when it has finally done compacting, and i'm assuming it overwrites the original tsm1-file with the tmp one, it is still the same size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 08:28 000000478-000035245.tsm
I can virtually reproduce this for any tsm-file on the filesystem. Which makes me question:
So, to exclude the very poor filesystem performance on my nodes, I changed the filesystem on all of my nodes to ZFS. I've created a mirrored pool, so effectively I am running the same thing as RAID1+0.
So the IOPS are loads better than what they were before:
Please note that this IO spike, however, still only occurs when the node is doing a full compaction. Which brings me to my next point, once a node is in full compaction mode, it never seems to finish.
Whereas there are nodes doing a normal compaction during regular intervals, once the full compaction happens, it seems to keep a shitload of compaction tmp files in the datadirs, and does not get rid of them unless I restart InfluxDB.
Now, this has become less of an issue now, as due to the fact that the disks during this io spike aren't at 100% ioutil anymore, I actually have headroom for metrics still to be coming in and not have the "spikey" behavior I see when disks are 100% utilised.
All of this, however, cannot be deemed default behavior. I'm nearly considering writing an event handler in my monitoring system, checking on the system / cpu / disk load, amount of compaction tmp files, to initiate a restart of InfluxDB.
What information can you provide me in the meanwhile about how compaction works? Is there any other less hacky way of ensuring that this doesn't happen?
@sysadminsemantics The tmp
files are created by the compaction process and installed when the compaction complete. Once a TSM file is written, it is never modified. The tmp
files are newly compacted TSM files that are not ready to be used just yet. We have various compaction stages that move data from a write optimized format into a read optimized format with higher levels of compression. A full compaction produces an optimal format for querying.
If you are restarting the server in the middle of a compaction, then the compactions will re-run the next time, but it will need to start over. A full compaction runs after a shard is cold (e.g. no writes) for 24hrs by default. These don't run frequently because they can be very CPU intensive and full compactions can take a long time due to a variety of factors. Other types of compactions run while the shard is accumulating new data.
In the issue description, you were expecting that there would be no IO or load spikes after writes have stopped. This is not the way the system works since shards are compacted and fully compacted after 24hrs. You can try lowering your [data].compact-full-write-cold-duration
to 1hr to kick the process off more aggressively. If a write comes in to a shard that is fully compacted, it would be re-compacted after the cold duration threshold is exceeded. This can happen if you are writing data into the past or overwriting existing values. Doing either of those things makes compactions much more CPU intensive and slower.
I'll try to answer some of your earlier questions:
looking at the /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm file at the time of compaction, it is the following size: -rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:55 000000478-000035229.tsm
the logfiles appear to mention it writing it into a "tmp" file, which is the exact same size: -rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:59 000000478-000035231.tsm.tmp
and when it has finally done compacting, and i'm assuming it overwrites the original tsm1-file with the tmp one, it is still the same size: -rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 08:28 000000478-000035245.tsm
This is a different file than was being compacted 000000478-000035231.tsm.tmp
would have produced 000000478-000035231.tsm
when completed. The sizes are all 2GB because that is the maximum size of a TSM file before we split it into a second file. The contents of each of these tmp
files would be the combination of files used for the compaction. Blocks within the file would be more dense, series data will be sorted and have better locality, index would be better organized and deleted data would be removed.
I can virtually reproduce this for any tsm-file on the filesystem. Which makes me question:
what is compaction intended to do?
Compactions convert series data from a write optimized, less compact format to read optimized compressed format. They also expunge deleted data and help to keep larger shards able to be loaded quickly.
will influxDB always attempt to compact a file, even though it cannot be compacted any further?
No. Once a shard is fully compacted and no new writes to the shard occur. It will not be re-compacted.
what is the mechanism for compaction?
I don't understand the question.
is there a way to schedule compaction?
No, they happen automatically to keep the shard both queryable and writeable. You can lower some of the thresholds used to trigger snapshot and full compactions, but I would not recommend changing the defaults without a good reason.
Should be fixed via #7258
Bug report
System info:
4 of these (non clustered, just mirroring all metrics data x 4 via telegraf, might shard or truly cluster these at some point): ProLiant XL170r Gen9 Intel(R) Xeon(R) CPU E5-2660 (40 cores hyperthreaded) 256GB of RAM 5 x 960GB SSD's (HP MK0960GECQK, apparently these are refurbished Samsungs), out of which:
Steps to reproduce:
Expected behavior:
Actual behavior:
I've attached a few screenshots of my influxDB instance metrics; as well as the reports per node (both during and after the occurrence). Also; separate bug, did we remove UDP metrics from _internal?
The gist for the bugreport is here: https://gist.github.com/sysadminsemantics/18a9692acfeb417deba5b2ab7f67e825
You'll probably need more info, but if you have any pointers I could work with in the meanwhile, that would be greatly appreciated. Forgot to mention (but should be apparent in the graphs), that the bad state occurs around 12:30.