{non issue} scaling out for M+ metrics per min

gmlwall commented 9 years ago

Hey, sorry to raise an issue, i didnt know of any other way to get in contact though. As mentioned in my previous issue would love to discuss some architecture with you around scaling / maintaining front end data query integrity if you have some time :)

Im in GMT so hopefully if your not to busy we can talk, not sure what your preferred method is, but if you want to, please contact me on skype, send me a message if you have another preferred medium. my contact is

gmlwall

please feel free to close this non issue out. Thanks again

grobian commented 9 years ago

I don't use skype, and I have an irregular life. Perhaps you could send me your questions/concerns/idea via email?

rtoma commented 9 years ago

Great and interesting topics. Relevant to me, because I am in the progress of migrating a 40-45k datapoints/sec setup to 2 new servers using carbon-c-relay and fusionio.

I am willing to share my observations, graphs and configs. Maybe combine our observations, tips & tricks?

deniszh commented 9 years ago

IMO this "issue" is a good place to share scale experience, isn't it?

rtoma commented 9 years ago

+1

gmlwall commented 9 years ago

Ive gradually been scaling our graphite system as the data going into it has increased rapidly. What started as a single system has been expanded into 4 high end servers. This doesn't give highly available data though, and just about manages to keep up with writes (though disk I/O) is fully maxed. Servers have 24 cores / 200GB Ram / 6 disks in RAID5 for metrics / 2 for OS. So 7 more servers are ready to make the cluster HA and ive been trying different setups. The current setup looks like this. (setup has changed a few times as when i first set it up i didnt realise graphite returned the first dataset it finds, and doesnt concatenate them together, therefore you have to ensure all data for one set is only in 1 place, or if its in multiple places, has to be complete )

HAProxy TCP Load Balancer - {LB between 4 carbon-c-relays}

4 carbon-c-relays - { regex forces metrics to 1 of 4 servers / LB between 9 carbon caches per server }

9 carbon caches x 4 { writing locally to whisper files }

This gets queried by

one "frontend" server 1x Grafana 1x Graphite ( has no data locally, has the 4 backend servers as cluster servers )

4 "backend" servers Carbon-C-Relay Graphite (each on backend box serving local metrics via API) 9x Carbon Caches

So all servers are

1x Frontend Box { HAProxy | Grafana | Graphite } 4x Backend Box { Carbon-C-Relay | Graphite | Carbon-Caches }

Pros: It works :) - for now

Cons: As i have to force data sets to individual servers its quite alot to manage / hot spots can turn up easily

Ideas to scale this to HA:

1) Add 4 more backend servers; use carbon-c-relay to duplicate the traffic and have a mirror of the data.

a) Add the 4 servers as clustered servers, as graphite only returns the first data set found it would matter as they are identical. In case of failure only reads dont get served from the down server and only its mirror. b) Add the 4 servers as a 2nd datasource in Grafana, if there is a failure on one of the nodes, set the 2nd data source as default

2) Get rid of the backend as it is now, and use something like cyanite + cassandra. This seems ideal to me, as there would be no manual regex / moving data around etc. As cassandra would deal with all of that. Unfortunately I've spent 2 days trying to get the master to work and, at this time of writing doesnt handle the I/O i need to send through it.

At the moment option 1a is the most likely candidate. Im in the process of asking for dev time at my company to have a full team take cyanite and turn it into a production ready service, as i can see great potential in it (it deals with hashing, data replication, and could scale horizontally, without the manual effort of moving datasets, writing regex, adding front end API querys)

rtoma commented 9 years ago

If I/O is your bottleneck, have you considered SSD or even Fusion I/O?

Old hardware:

2 separate cluster
each cluster has 2 nodes
each node has 12cores, 400GB storage (4x200GB solidstate in RAID1)
each node has 2 relays + 2 caches, using RR dns to 'balance' of the relays
data is sharded (node failure != data loss)

New hardware:

2 nodes
each node has 16cores, 3.2TB Fusion I/O
each node has 2 carbon-c-relays and 4 caches
each node has haproxy to balance over its local c-relays, unsure how to balance between nodes atm
data is replicated (node failure != data loss)

The SDD's are at 20k writes/sec and 70% utilization and something slow to respond. We are eating up SDD's within 2 years, because of the huge volume of writes.

I am expecting a lot from the Fusion I/O, but its too early to compare.

deniszh commented 9 years ago

Yep, well known thing - http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-5 - use SSDs. FusionIO is looks like total overkill for for me, try SSD first. But as @rtoma says - SSD lifetime is not so great with whisper workload. And, BTW, regarding "graphite returned the first dataset it finds, and doesnt concatenate them together" - this patch fixing that and really helps during cluster rebalance. Check Jack's postings - http://linuxczar.net/blog/2015/08/12/scaling-graphite/ and http://linuxczar.net/blog/2015/08/13/scaling-graphite-2/

deniszh commented 9 years ago

Ah, and about cyainite - looks promising IMO but now it's not even compete by performance even for moderately loaded Graphite cluster.

grobian commented 9 years ago

deniszh, you wrote a blog about our architecture right? Do you have the url handy? I'd have to dig it up, but it was pretty useful!

deniszh commented 9 years ago

Here you are - http://www.iamsysadmin.ninja/2014/12/graphite-scaling-and-my-evaluation-of.html

gmlwall commented 9 years ago

so looking at that blog, -please correct me if im wrong-

zipper stack sits between graphite-web and carbon relays, and does the job of taking the web querys and getting the data from your backends? The trade off being no cache query (disk reads only)

a few questions around that, 1) does the zipper stack aggregate results? so i could relay metrics round robin style instead of forcing via regex? 2) what kind of hardware / number of servers do you run to get disk flushes sub 1min?

grobian commented 9 years ago

zipper sits between graphite-web and carbonserver, the relays are not in the picture

zipper allows us to hide frequent failures we have with stores (raid controllers don't like the io writes we do) and temporary network problems by merging results from all stores and presenting the most complete view to the consumer

we don't use haproxy, instead all nodes run a local carbon-c-relay with any_of target to a set of bigger carbon-c-relays which write to the stores

we have around 2x150 stores, all our writes are duplicated, once in the same location per cluster (replication=2) and the same for another location these stores are backed by SSDs and use whisper storage

to isolate bad effects and users, we constructed currently around 10 different clusters, most metrics go just to one, but some go to more than just one cluster

on the stores, we handle more than 2M/s metrics in total (for all stores)

gmlwall commented 9 years ago

thats alot of stores! :) And alot of metrics!

Right that sounds interesting, ill try out a proof of concept on a smaller scale of the set up you have.

So just to clarify, i could send my metrics to any store, as zipper will do the aggregation for me? Essentially i wouldnt have to spend time re-routing and merging whisper files to different stores if one set became particularly high I/O?

szibis commented 9 years ago

You can use go-carbon as replace for multiple python carbon-cache's. One multi CPU daemon will simplify this setup and will release some resources.

grobian commented 9 years ago

Got a link for go-carbon? my carbonserver approach was too naive so we had to stick with carbon-cache. Obviously, on the stores we run carbon-c-relay to any_of between the locally running carbon-cache.pys.

@gmlwall I don't understand exactly what you mean by "send to any store". If you got a cluster, let's say 5 machines, you can distribute the metrics over those 5 by using a fnv1a_ch cluster in the c-relay config. This will make sure the same metrics always go to the same machine(s). That's the best for disk space usage. If you randomly send metrics to any of the 5 machines (like haproxy rr would do) you waste enormous amounts of space, and would need zipper to be able to see something useful.

grobian commented 9 years ago

regarding your original stup, as I understand it, you got 4 used servers and 7 spare.

What I would do, is setup the 7 machines, to have a RAID10, this because in this case it should get better IO performance than RAID5 (favouring performance over diskspace). Configure them to have a number of carbon-cache processes, we use 14 on SSDs, so the 9 you have sounds like a good number. Configure carbon-c-relay on the stores to distribute incoming metrics with an any_of cluster over the carbon-caches. This is better for performance, because the relay will ensure the same metrics end up at the same caches, which means the caches don't fight over the same metrics.

Configure the 4 carbon-c-relay servers to define a fnv1a_ch cluster with replication=2 over the 7 stores. Just route everything over that cluster, the hashing algorithm does the rest. Perferably use carbon-c-relays at the sender sides, but if you can't use anycast or haproxy on some pacemaker setup. With 7 stores on disks, I doubt whether you need more than 1 relay, so you could pacemakerise the relays too (having just 2).

Configure carbonzipper to run at the frontend machine(s). You'll need to point it at all the caches, or run carbonserver at the stores (which acts like a single cache). Point graphite-web at the locally running carbonzipper (as if it were a store). Now any failure of a (single) store won't affect you. The job you have left is synchronising the stores when they were down. For this tools like bucky, carbonate, etc. can be used. We use some custom brewed script that rsyncs in all metrics and whisper-fills them, which is in dire need of replacement, but gets the job done.

szibis commented 9 years ago

@grobian https://github.com/lomik/go-carbon

I use it from a week on production and looks better then carbon-cache on pypy. Gives lower CPU/Load and better I/O over fast SSD drives. After one day of testing i was able to get 250k+ updates per second (180k-200k IOPS) on five production i2.4xlarge AWS instances (replication=2). With top carbon-c-relays in this setup (balancing, and rewrites) we get fast and stable write metrics path. I think this setup can handle at least 2x more traffic on this AWS setup, which i need for 1second metrics on some production metrics collectors.

deniszh commented 9 years ago

go-carbon is awesome! Just want to spend couple of weekends to port 'carbon query bulk' and blacklist functionality there... Or I can turn off 'query-bulk' and use carbon-c-relay as blacklist filter though...

szibis commented 9 years ago

With go-carbon using graphite-web as API on carbon stores is simple and fast enough - only one carbonlink endpoint. Average response in about 50ms from local whisper, maks requests up-to 200ms. On fronts in my case same graphite-web merging requests from 5 instances and it's still fast enough. Average response depends on functions you choose on metrics and merging requests from 5 instances is not a problem with such traffic and response times i have. Maybe in future carbonzipper with carbonserver will do the job better. But i have graphite-web cache in couchbase (memcache) and nginx if needed.

gmlwall commented 9 years ago

Lots of really great ideas here, many thanks. @grobian im going to set up that stack as you suggested and leave it running over the weekend to compare I/O stats to the old cluster. go-carbon looks very interesting as well. Another branch of our company has been using OpenTSDB as thier backend storage. I have yet to do any investigation into its viability. Has anyone had any experience with it here, with any pros / cons as to why they didnt/did use it?

dgryski commented 9 years ago

@szibis on our hardware, carbonserver answers in ~0ms if it doesn't have a metric, and ~6-12ms if it does, and it handles globbing / wildcards.

grobian / carbon-c-relay

{non issue} scaling out for M+ metrics per min #95