graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.51k stars 490 forks source link

Performance Issue: Wich is the best carbon/whisper/system config to reduce disk write IOPS ? #553

Closed toni-moreno closed 8 years ago

toni-moreno commented 8 years ago

Hi , we have a Graphite/Carbon/Whisper box ( a VM made on top of VMware ESX, 8 cores , 16Gb RAM and the ESX attached to a Hitachi disk array) . We have currently receiving e 250K metrics / minute.

We have no option to get separate physical machines or SSD disk , and Storage Administrators have noticed issues on other servers placed in the same Hitachi disk array, because of the great number of write IOPS that we are sending to the disk array.

As you can see in the next picture , we have sending 9K IOPS

image

Whe have also aggregation on whisper files with this default resolution/retention.

[default]
pattern = .*
retentions = 60s:15d,5m:90d,1h:410d

In the past we have reduced Read IOP's by caching data in memory as described on the following issue (https://github.com/graphite-project/carbon/issues/497) and we have 70% of memory with cached data as you can see in the next picture.

image

Now we need enable any Carbon/Whisper/System configuration to also cache Write data and reduce the Writee # IOPS on the disk array.

Can anybody help us please? Any idea?

genisd commented 8 years ago

Afaik there is no way to reduce write iops. Aside of sending less metrics or keeping a lower granularity.

toni-moreno commented 8 years ago

Hi @bmhatfield @mleinart any suggestion on this issue?

bmhatfield commented 8 years ago

In your specific configuration, you can reduce write IOPs, but it requires a bit of a trade-off. Each Whisper retention requires a fair bit of IO, which you can cut down by simplifying your policy to 2 or even 1 retention, instead of 3. I forget who said it (perhaps @mleinart), but the advice I have heard is to aim for an upper bound of 2 retention rates.

Note that you will have to resize your existing metrics in addition to changing the config to realize this value.

An example reduction in your configuration might be to change 60s:15d,5m:90d,1h:410d

to

60s:60d,1h:2y, for example.

toni-moreno commented 8 years ago

Thank you for the fast answer @bmhatfield,

There is not any way to force system doing IO with biggest chunks of data on each io operation ?

Really nothing?

bmhatfield commented 8 years ago

I don't have anything I can offer you off the top of my head. If you decide to experiment and discover something meaningful, We'd LOVE to hear about it and get it merged in.

mleinart commented 8 years ago

The main configuration knob you can fiddle with is MAX_UPDATES_PER_SECOND (SAN thrashing is actually what this setting and carbon-cache were originally written for). Reducing this will throttle writes to disk and force more points per metric to be cached. If the Hitachi isnt liking tons of small random writes, it may behave better when writes are allowed to batch up more. Keep an eye on pointsPerUpdate as well as cache.size as you reduce it. The big downside with doing this is that more data is cached in carbon and subject to be lost if you lose the VM or something

You might also look into the vm.dirty_ratio and vm.dirty_background_ratio kernel settings to tune the page cache. My instinct would be to reduce the kernel's caching and increase carbon's (as done above), though I don't have any direct experience doing that.

I dont think I'd recommend it in this scenario, but if you wanted to bypass the kernel cache entirely (as in the #535 pull you cite) you could set WHISPER_AUTOFLUSH to true. This causes a flush() to be called after every write. If you do choose to try this, I'd only do it after getting your pointsPerUpdate nice and high as you could thrash the storage array even more without that.

hope this helps!

deniszh commented 8 years ago

Hi @toni-moreno, We're still using SAN on one of our clusters, and I can just confirm what @mleinart said - tune your MAX_UPDATES_PER_SECOND and also you can try to set WHISPER_AUTOFLUSH to true.

toni-moreno commented 8 years ago

Hi @deniszh @mleinart , I've been doing a stress test with this tool (https://github.com/feangulo/graphite-stresser) in a testing server with exact infrastructure than production servers.

We have done two test with 86K metrics.

The first with MAX_UPDATE_PER_SECOND=50000 ( no limit), Graphite behavior ok with 3K IOPS to the Hitachi array.

When configured MAX_UPDATE_PER_SECOND=1000 Graphite behavior was erratic.

I've tested data availability with this script (https://gist.github.com/toni-moreno/fa174cc1bb38f7178afa09305f3c5397), that does a lot of http request to metrics loaded from a metric list file, and finally logs response time and how many nulls ( not available data) at the end of the time series.( 1 data-point / minute)

As you can see below there is a lot of metrics without data from 20/60 minutes ago. It seems like the merge done with stored data in disk and queue data in memory was not working fine.

Could be a graphite bug perhaps?

[ Thu Jun  2 19:01:55 2016 ] OK METRIC:  STRESS.host.ip-57.com.graphite.stresser.d.m15_rate 720 nulls: 0 elapsed time:  0.0688331127167
[ Thu Jun  2 19:01:55 2016 ] OK METRIC:  STRESS.host.ip-88.com.graphite.stresser.cadb.m1_rate 720 nulls: 6 elapsed time:  0.0508050918579
[ Thu Jun  2 19:01:55 2016 ] OK METRIC:  STRESS.host.ip-82.com.graphite.stresser.bcda.stddev 720 nulls: 0 elapsed time:  0.0655510425568
[ Thu Jun  2 19:01:56 2016 ] OK METRIC:  STRESS.host.ip-62.com.graphite.stresser.dab.p95 720 nulls: 0 elapsed time:  0.0328180789948
[ Thu Jun  2 19:01:56 2016 ] OK METRIC:  STRESS.host.ip-60.com.graphite.stresser.abcd.min 720 nulls: 49 elapsed time:  0.199838161469
[ Thu Jun  2 19:01:57 2016 ] OK METRIC:  STRESS.host.ip-16.com.graphite.stresser.bc.p99 720 nulls: 0 elapsed time:  0.030189037323
[ Thu Jun  2 19:01:57 2016 ] OK METRIC:  STRESS.host.ip-74.com.graphite.stresser.a.min 720 nulls: 0 elapsed time:  0.0377690792084
[ Thu Jun  2 19:01:57 2016 ] OK METRIC:  STRESS.host.ip-78.com.graphite.stresser.cda.m5_rate 720 nulls: 0 elapsed time:  0.0334861278534
[ Thu Jun  2 19:01:58 2016 ] OK METRIC:  STRESS.host.ip-1.com.graphite.stresser.cad.m1_rate 720 nulls: 32 elapsed time:  0.0600860118866
[ Thu Jun  2 19:01:58 2016 ] OK METRIC:  STRESS.host.ip-53.com.graphite.stresser.ca.min 720 nulls: 0 elapsed time:  0.0442810058594
[ Thu Jun  2 19:01:59 2016 ] OK METRIC:  STRESS.host.ip-59.com.graphite.stresser.dba.max 720 nulls: 0 elapsed time:  0.0432779788971
[ Thu Jun  2 19:01:59 2016 ] OK METRIC:  STRESS.host.ip-17.com.graphite.stresser.cdab.p98 720 nulls: 14 elapsed time:  0.110908031464
[ Thu Jun  2 19:01:59 2016 ] OK METRIC:  STRESS.host.ip-28.com.graphite.stresser.dcb.p999 720 nulls: 31 elapsed time:  0.0806908607483
[ Thu Jun  2 19:02:00 2016 ] OK METRIC:  STRESS.host.ip-48.com.graphite.stresser.dacb.m1_rate 720 nulls: 6 elapsed time:  0.217912197113
[ Thu Jun  2 19:02:00 2016 ] OK METRIC:  STRESS.host.ip-49.com.graphite.stresser.abdc.stddev 720 nulls: 1 elapsed time:  0.0250878334045
[ Thu Jun  2 19:02:01 2016 ] OK METRIC:  STRESS.host.ip-87.com.graphite.stresser.cdba.mean_rate 720 nulls: 19 elapsed time:  0.0460441112518
[ Thu Jun  2 19:02:01 2016 ] OK METRIC:  STRESS.host.ip-56.com.graphite.stresser.bacd.min 720 nulls: 15 elapsed time:  0.0483469963074
[ Thu Jun  2 19:02:01 2016 ] OK METRIC:  STRESS.host.ip-28.com.graphite.stresser.dab.stddev 720 nulls: 32 elapsed time:  0.0925581455231
[ Thu Jun  2 19:02:02 2016 ] OK METRIC:  STRESS.host.ip-49.com.graphite.stresser.abc.mean 720 nulls: 1 elapsed time:  0.0342679023743
[ Thu Jun  2 19:02:02 2016 ] OK METRIC:  STRESS.host.ip-50.com.graphite.stresser.bad.min 720 nulls: 19 elapsed time:  0.0319290161133
[ Thu Jun  2 19:02:03 2016 ] OK METRIC:  STRESS.host.ip-4.com.graphite.stresser.ac.p98 720 nulls: 1 elapsed time:  0.0867421627045
[ Thu Jun  2 19:02:03 2016 ] OK METRIC:  STRESS.host.ip-15.com.graphite.stresser.adbc.mean_rate 720 nulls: 47 elapsed time:  0.066370010376
[ Thu Jun  2 19:02:03 2016 ] OK METRIC:  STRESS.host.ip-42.com.graphite.stresser.cb.p50 720 nulls: 1 elapsed time:  0.0299911499023
[ Thu Jun  2 19:02:04 2016 ] OK METRIC:  STRESS.host.ip-79.com.graphite.stresser.bdc.mean 720 nulls: 1 elapsed time:  0.127055883408
[ Thu Jun  2 19:02:04 2016 ] OK METRIC:  STRESS.host.ip-27.com.graphite.stresser.dacb.max 720 nulls: 1 elapsed time:  0.0695948600769
[ Thu Jun  2 19:02:05 2016 ] OK METRIC:  STRESS.host.ip-82.com.graphite.stresser.d.p75 720 nulls: 1 elapsed time:  0.0412969589233
[ Thu Jun  2 19:02:05 2016 ] OK METRIC:  STRESS.host.ip-48.com.graphite.stresser.abdc.p98 720 nulls: 14 elapsed time:  0.0475289821625
[ Thu Jun  2 19:02:05 2016 ] OK METRIC:  STRESS.host.ip-71.com.graphite.stresser.dca.m15_rate 720 nulls: 1 elapsed time:  0.0548729896545
[ Thu Jun  2 19:02:06 2016 ] OK METRIC:  STRESS.host.ip-48.com.graphite.stresser.dbac.mean 720 nulls: 1 elapsed time:  0.0398018360138
[ Thu Jun  2 19:02:06 2016 ] OK METRIC:  STRESS.host.ip-89.com.graphite.stresser.cabd.m15_rate 720 nulls: 1 elapsed time:  0.11070394516
[ Thu Jun  2 19:02:07 2016 ] OK METRIC:  STRESS.host.ip-70.com.graphite.stresser.dabc.count 720 nulls: 1 elapsed time:  0.0314590930939
[ Thu Jun  2 19:02:07 2016 ] OK METRIC:  STRESS.host.ip-56.com.graphite.stresser.adc.stddev 720 nulls: 1 elapsed time:  0.039439201355
[ Thu Jun  2 19:02:07 2016 ] OK METRIC:  STRESS.host.ip-38.com.graphite.stresser.dcab.m5_rate 720 nulls: 23 elapsed time:  0.0327999591827
[ Thu Jun  2 19:02:08 2016 ] OK METRIC:  STRESS.host.ip-77.com.graphite.stresser.abc.count 720 nulls: 29 elapsed time:  0.136918783188
[ Thu Jun  2 19:02:08 2016 ] OK METRIC:  STRESS.host.ip-82.com.graphite.stresser.bacd.stddev 720 nulls: 1 elapsed time:  0.0634460449219
[ Thu Jun  2 19:02:09 2016 ] OK METRIC:  STRESS.host.ip-48.com.graphite.stresser.cbda.min 720 nulls: 1 elapsed time:  0.0412521362305
[ Thu Jun  2 19:02:09 2016 ] OK METRIC:  STRESS.host.ip-16.com.graphite.stresser.c.count 720 nulls: 44 elapsed time:  0.0268151760101
[ Thu Jun  2 19:02:09 2016 ] OK METRIC:  STRESS.host.ip-57.com.graphite.stresser.dbca.p999 720 nulls: 1 elapsed time:  0.0261521339417
[ Thu Jun  2 19:02:10 2016 ] OK METRIC:  STRESS.host.ip-81.com.graphite.stresser.bdca.mean 720 nulls: 1 elapsed time:  0.0479469299316
[ Thu Jun  2 19:02:10 2016 ] OK METRIC:  STRESS.host.ip-52.com.graphite.stresser.bd.max 720 nulls: 35 elapsed time:  0.0628139972687
[ Thu Jun  2 19:02:10 2016 ] OK METRIC:  STRESS.host.ip-84.com.graphite.stresser.dacb.p75 720 nulls: 1 elapsed time:  0.0726640224457
[ Thu Jun  2 19:02:11 2016 ] OK METRIC:  STRESS.host.ip-52.com.graphite.stresser.da.stddev 720 nulls: 16 elapsed time:  0.0722517967224
[ Thu Jun  2 19:02:11 2016 ] OK METRIC:  STRESS.host.ip-71.com.graphite.stresser.ad.m15_rate 720 nulls: 53 elapsed time:  0.0792770385742
[ Thu Jun  2 19:02:12 2016 ] OK METRIC:  STRESS.host.ip-33.com.graphite.stresser.adcb.min 720 nulls: 11 elapsed time:  0.0649588108063
[ Thu Jun  2 19:02:12 2016 ] OK METRIC:  STRESS.host.ip-47.com.graphite.stresser.dac.max 720 nulls: 22 elapsed time:  0.042662858963
[ Thu Jun  2 19:02:12 2016 ] OK METRIC:  STRESS.host.ip-11.com.graphite.stresser.bc.p75 720 nulls: 57 elapsed time:  0.066419839859
[ Thu Jun  2 19:02:13 2016 ] OK METRIC:  STRESS.host.ip-39.com.graphite.stresser.dc.p99 720 nulls: 0 elapsed time:  0.0433480739594
toni-moreno commented 8 years ago

@mleinart , @deniszh , I've been having some issues with this exact version (from git master branch last June)

GRAPHITE-WEB

commit 67e463e1efa85b8c5cf022f9abffa3d739175d1e
Merge: ec22fe2 34b223a
Author: Jeff Schroeder <jeffschroeder@computer.org>
Date:   Mon Jun 15 17:40:11 2015 -0500

    Merge pull request #1250 from SEJeff/fix-django18

    Make 'pip install -r requirements.txt' work again

CARBON

commit b80ce915a5e420b46e6972512801491e536db1b6
Merge: 94d9f18 1003df1
Author: Jeff Schroeder <jeffschroeder@computer.org>
Date:   Fri Apr 24 15:04:12 2015 -0500

    Merge pull request #409 from mleinart/aggregator_buffer_tests

    New tests for carbon aggregator buffers

WHISPER

commit 1e96c0cd1dc0b361177c585033cfbbb5711a191f
Merge: 75e35fd bbd37c5
Author: Jeff Schroeder <jeffschroeder@computer.org>
Date:   Wed Jun 24 01:36:09 2015 -0400

    Merge pull request #66 from acdha/patch-1

    Simple script to find corrupt Whisper files

I'm thinking to repeat this test after update to some newer /stable version.

Which version is the most suitable for storage arrays and working with python 2.6.6 (RHEL6.7) ??

toni-moreno commented 8 years ago

Here some graphs with performance in the stress test done yesterday ( from 17:00 until 20:00)

while the stresser was writing data we have launched the http tester script ( from 18:00 to 19:00) , in this time we have done 8763 request, and 3478 has more than 3 nulls ( more than 3 minutes delay).

image

image

The delay average on these "bad" request is 25 minutes.

deniszh commented 8 years ago

@toni-moreno, Sorry, but I'm completely lost your point - what are you trying to do or proof here? Graphite is a complex system, and like any complex system, it could degrade in many strange or bizarre ways. My cluster is working fine but if I put 10x more load there - It will die horribly.

toni-moreno commented 8 years ago

Hi @denish , And sorry for my poor English.

As I said (https://github.com/graphite-project/carbon/issues/553#issue-152966004) , we need limit write IOPS on the underlying storage.

I will do stress test with different configurations and graphite versions to evaluate the best way to decrease IOPS. But also I need all available data online

We take into account that a request is ok for me if it contains all data from any time in the past to at least 3 minutes ago.

With the script (https://gist.github.com/toni-moreno/fa174cc1bb38f7178afa09305f3c5397) we can measure how many minutes on data leaks each graphite response and how many request are not OK.

With MAX_UPDATE_PER_SECOND=inf ( and always the same load) all is working fine

In the stress test done yesterday I changed config to MAX_UPDATE_PER_SECOND=1000, with this configuration carbon seems like all carbon-cache , were storing in a queue data-points in memory.

image

But It seems like carbon only are serving data stored in disk to graphite-web frontend .If I'm not wrong , carbon should merge both ( disk stored data, and memory queued data) isn't it?

When looking more in detail the output log we can see (https://github.com/graphite-project/carbon/issues/553#issuecomment-223360560 ) a lot of request with more than 15 minutes delay (nulls: 32, nulls: 44,nulls: 35, etc).

Is this behavior usual? or perhaps a but?

Anyway these behavior is completely undesirable for us.

We are planning (if needed ) a graphite/carbon/whisper update and repeat these tests again.

Which version is the most suitable for storage arrays and working with python 2.6.6 (RHEL6.7) ??

Thanks a lot for your help

deniszh commented 8 years ago

Hi @toni-moreno, Nah, I didn't mean your English, mine is also terrible, even with Grammarly. :) I say that target of your test was not clear, at least for me - and now it's clear, thanks for an explanation.

But It seems like carbon only are serving data stored in disk to graphite-web frontend .If I'm not wrong , carbon should merge both ( disk stored data, and memory queued data) isn't it?

Exactly, it should. That's why carbonlink protocol exists. It's possible to serve metrics from disk, but only to some extent and on SSDs, of course. So, it looks like there's some problem with your setup - because the cache is not working, and you're able to serve metrics from disk only.

Which version is the most suitable for storage arrays and working with python 2.6.6 (RHEL6.7) ??

Graphite has no special versions for SAN disks. Master branch switched to Python 2.7, but 0.9.x should work on 2.6. The latest release is 0.9.15 now. You can also use 0.9.x branch from Github.

deniszh commented 8 years ago

@genisd

I believe that they're two different processes, which have nothing in common. No shared memory buffer or awareness. Some processes write and some read. As far as reading is concerned I think the underlying filesystem cache is the only real caching for reading data. I don't think this has been changed in recent versions, but I could be wrong.

Sorry, but it's not how Graphite works. Of course, graphite-web and carbon are different processes, and not using shared memory. That's why carbon not only writing metrics on disk but also storing them in cache, and return seamlessly merged result to graphite-web

toni-moreno commented 8 years ago

@deniszh : Hi , I will try to "downgrade" to the last 0.9.X git commit , and I will repeat the test.

About downgrade, When I did installation ( one year ago) I did with. pip/install.

pip install -r requirements.txt
python ./setup.py install

Should "python ./setup.py uninstall" be enough to clean current version for graphite-web/carbon/whisper? . Any suggestion about the best way to clean old version/downgrade to 0.9.15 ?

piotr1212 commented 8 years ago

We've got VM's which process over 1.500.000 metrics (one minute interval) on HDS SAN doing about 3k write IOPS . These systems have MAX_UPDATE_PER_SECOND=100 and are running with 6 caches. Downside is that cache size get huge. (about 30 minutes, see pointsperupdate metric)

Data cached should be visible, I expect some issue with your carbonlink hosts in local_settings.py, cache_query port or with your relay config (if you are using one).

After you tuned your MAX_UPDATE_PER_SECOND I expect you will see a lot of read IOPS and they are slow on the SAN (this might depend on the used disks/size of array)... The reads are needed for graphite to the aggregations to lower precision. You can add more RAM to the VM to overcome the reads. With plenty of RAM the reads will come from linux fs cache, instead of being routed on the SAN. The VM's mentioned above have 128 GB.

obfuscurity commented 8 years ago

I've elaborated on batch writes here. It should address any questions you may have around increasing batch writes to decrease write operations.

luckywarrior commented 7 years ago

try another project douban/kenshin,which solved iops problem,thanks!

deniszh commented 7 years ago

@luckywarrior - oh, thanks for an info! Does it works good? Could you please share some numbers (read load, write load, number of metrics, size of files on disk)?

luckywarrior commented 7 years ago

@deniszh I test with 6 instances per server with 4 core cpu and 4 G memory,when the load came up to almost 100%, i got following assessment result:

150k metrics received / 10 secs / carbon-c-relay 0 relay dropped 250 iops