rollups. - Githubissues

Dieterbe commented 8 years ago

the time has come.

https://github.com/raintank/ops/issues/112 has some details
i know we wrote down some thoughts etc at the summit, do we still have those notes? or perhaps not that important
implementation will probably be an nsq consumer that generates all lower-res streams and stores them (including for current data), as opposed to a design where lower-res only starts where higher-res ends.
we can generate spread data like librato/omniti/hostedgraphite, or store individual min/max/avg/.. series, or use an algo like LTTB (see https://github.com/sveinn-steinarsson/flot-downsample/)

Dieterbe commented 8 years ago

the way i see it, before coming up with proper rollup settings, it's a good practice to first list the desires and be aware of each of them, because this is where we may have different opinions: (and i think maybe i forgot a few considerations that @torkelo or @woodsaj may think of)

ideally, long archives at high quality so we have ample quality even for short timeframes for long ago (but goes against backend resources, performance, data transfer, and client side computation)
data as lightweight as possible to still provide high resolution for commonly requested ranges
round up timeframes, like aim for periods such as 8 days, 13 months, etc so you can compare current day against the day a week ago, etc.
not too much redundant data (multiple "resolution bands" shouldn't have too much overlap)
allow multiple input resolutions, good fit of rollups for each. (i think high resolution should also stay fairly high resolution throughout different bands, whereas if data comes in at low resolution, we can also be more aggressive in our rollups)

woodsaj commented 8 years ago

The two key drivers for rollups are:

improve loading times of graphs that span large timeframes.
reduce storage requirement

right now, item 1 is of higher priority then item2.

In addition to the requirements listed above, i would add

support variable input resolutions. Meaning users can start sending a metric every 10seconds and change it to every 60seconds and vice versa

I am not sure if an additional NSQ consumer is the right approach.

In order to perform rollups, we are going to need to buffer metrics in memory. Querying this inMemory data is going to be faster then querying the TSDB, and so we should have graphite leverage it. In which case we can delay when we write to C*

So, my view is that we read from NSQ, then just write to the inMemory store. We then have another process or thread that uses this inMemory store to:

periodically flush the inMemory data to C* as raw data. This has the advantage of improving write performance on cassandra. (this is what the signalFx article refers to as stage 1 - vertical writes)
periodically write aggregated data to C*

Dieterbe commented 8 years ago

Querying this inMemory data is going to be faster then querying the TSDB, and so we should have graphite leverage it

how sure are we that there is a significant difference in response time for serving read requests for recent data? or maybe this question is not that relevant because if we're going to batch up writes (and it looks like we should) then we need this in-memory component to satisfy hot data anyway.

i agree with your appoach, but the way i envision this, the nsq reading, inMemory store, and C* flushing can all happen in the same process, until we reach scalability limits.

Dieterbe commented 8 years ago

currently prototyping this

Dieterbe commented 8 years ago

progress is at https://github.com/raintank/raintank-metric/tree/tank https://github.com/raintank/raintank-docker/tree/tank

Dieterbe commented 8 years ago

basic skeleton is done
basic in-mem chunkserver prototype works. it uses a circular buffer of go-tsz compressed chunks, for each metric key.
simple http interface for serving up data works
wrote unit tests and have been manually testing in devstack, single endpoint, 5 chunks of 120s each seems to work fine

(https://github.com/raintank/raintank-metric/tree/tank/nsq_metrics_tank has some details)

next up:

a modified graphite-kairosdb to query nsq_metric_tank for data no less then a configurable amount of seconds old (chunkSpan*(numChunks-1)) would be nice. @woodsaj can you give this a shot?
send many more metrics, validate http output looks good with larger chunk sizes. I think having (1) in dev stack would help here. check cpu/ram usage, response times.
implement rollups
implement saving to cassandra. i know very little about cassandra and i wonder if @woodsaj is also interested in giving this a shot. https://github.com/gocql/gocql looks like the best go library for cassandra but i read somewhere that CQL is not that mature and that "old style" is often still better (though not sure what that means). anyway just some code that connects to C* and can save new chunks to specific per-metric rows or something would be nice, i can then glue it into the daemon.

torkelo commented 8 years ago

nice progress! is nsq_metrics_tank a separate process or is part of raintank-metric process that feeds on NSQ.

Just curios of the topology / processes involved the the metric stack. has NSQ completely replaced RabbitMq? What process receives metrics from collectors and batches them onto NSQ, and then there is another process that receives them and saves meta data to elastic and metrics to cassandra?

Dieterbe commented 8 years ago

every nsq_* app in this repo runs as a service. there is no more raintank-metric process. nothing in here feeds to nsq, they all consume from nsq. there is one for maintaining the metric definitions in ES, one for storing probe events in ES, one for saving metrics to kairos. and now this new one, which should eventually also replace the latter.

we still use rabbitmq for the grafana app bus and perhaps a few other things, not sure. but nsq is used for high-throughput items (metrics and probe-events). it's kind of annoying that we have 2 messaging systems, but they have different characteristics and we exploit that. using rabbitmq for everything would be far from ideal, and ditto for NSQ probably ( @woodsaj is more familiar with the rabbitmq specifics)

grafana receives the data from the collectors (they run in collector-controller mode only) and sends it to NSQ. the collector-controllers also doesn't batch anymore, it exploits the fact that incoming data is naturally batched. in fact, batches are split up to make sure the messages stay under 10MB each.

PS: i will update the readme based on this

torkelo commented 8 years ago

@Dieterbe thanks for clarifying, make sense. Would be interesting to know if rabbitmq is required or we could use NSQ there as well.

Been thinking about Grafana and distributed setups (sharing cache state. for example for the alerting def index). What would be the best way for grafana nodes to talk to each other? NSQ, nanomsg. rabbit, raft..

Dieterbe commented 8 years ago

that sounds like something we should have a deeper conversation about, to learn context, requirements etc. happy to hangout about that some time. is there a ticket where we can resume this convo?

nopzor1200 commented 8 years ago

Just wanted to add in my 2c here, around a partcular angle (disclaimer: I don't know enough about the tradeoffs for nsq vs rabbit for various use cases)...

over the long term, it is definitely a downside from a distribution/packaging/supportability standpoint if our stack will require nsq (or whatever name we end up giving that component of the stack) and rabbit.

maybe not that big a deal for the saas offering, but if nsq could be a tight no-dependency part of our stack, it seems ideal to evaluate using it for * esp for the on-prem / downloadable use case.

Dieterbe commented 8 years ago

let's please have that conversation elsewhere (maybe in strategy repo or something)

woodsaj commented 8 years ago

@Dieterbe a few issues.

1) you are using metric.Name as the key, but this is not a globally unique identifier for a metric series as it does not contain any data about who owns the metric (Org). This can be easily addressed by changing https://github.com/raintank/raintank-metric/blob/tank/nsq_metrics_tank/handler.go#L36 to m := h.metrics.Get(metric.Id())

2) As with the above, the HTTP interface needs to either a) accept the metric ID as the query term, or b) keep a local index of metric.Id's and metric.name + org_id. Options A is obviously simpler and also preferable as Graphite-api gets the metric.Ids from Elastic anyway.

other then that, this is looking awesome.

Dieterbe commented 8 years ago

@woodsaj ok will look into that. do you have anything to say to what i asked about populating cassandra? (last point of https://github.com/raintank/raintank-metric/issues/21#issuecomment-142921544) thanks.

woodsaj commented 8 years ago

yes, happy to help out, and have already researched sufficiently to get started.

woodsaj commented 8 years ago

Basic cassandra support in https://github.com/raintank/raintank-metric/tree/tank_to_cassandra

The methods are there for sending data, but not sure where the data should be written from. What mechanism will flush out the aggregated metrics?

Dieterbe commented 8 years ago

now we should have flushing of the chunks to cassandra (it says it saves fine) + loading of the chunks from cassandra in the http interface to satisfy timestamps that fall outside of the in-memory range. for some reason it doesn't actually work yet but i think we're close :-p

btw, would be nice if we got the http json output to be the same format as graphite ("array" style [{"target": "error", "datapoints": [[null, 1443070801]], ...}]) but i haven't figured out yet how to do that

Dieterbe commented 8 years ago

summary

loading and saving seems to work, verified via json api in dev stack and working with a single endpoint

todo

high load testing: high write-load + high read-load. compare performance to our current stack. in terms of request handling capability, cpu and mem usage. also the effect on cassandra mem, cpu and disk. env-load and vegeta will make this fairly easy. we must also verify some random series and check that their results look good, during these benchmarks. test different kinds of read queries (long, short, old data, recent, etc)
verify rollup outcomes
create a graphite plugin that queries NMT (for now it's the gateway for both hot and cold data through its json api)
think about what happens in case of crashes or if we need to restart server. i.e. partial chunks during startup, aggregations, do we use a WAL?

Dieterbe commented 8 years ago

we can already easily spin up raintank-docker as current stack or tank based stack (remember to rebuild docker images!). but now i've been working on a script to collect perf metrics, do benchmarks with vegeta, verify correctness of data returned by graphite, as well as documenting the procedure so that the process is smooth and minimizes room for errors https://github.com/raintank/raintank-docker/wiki/performance-testing-a-timeseries-backend once some more things are fixed/implemented, it should be trivial to benchmark both stacks in a structured manner and compare them.

Dieterbe commented 8 years ago

to figure out before starting benches:

NMT requests/s doesn't seem accurate. should be around 120/s after env-load see tail -f /var/log/raintank/grafana-dev.log | grep 'job results' | linecounter -freq=10000 select * from stats.timers.nsq_metrics_tank.nmt.requests_span.mem.count_ps;
alert outcomes shows a lot of empty response (though sudo ngrep -d any -W byline host 172.17.0.45 and port 6063 looks pretty good)

Dieterbe commented 8 years ago

summary

https://github.com/raintank/raintank-tsdb-benchmark lets us do benchmark runs. it replaces the wikipage mentioned above.
disabled aggregations for now. they made things more complicated to verify and i want to focus on verifying old vs new on equal terms.
i also disabled alerting (well, the tsdb-benchmark readme says to disable it in the grafana config, but it's not committed into the repo) cause there's some known issues i don't want to depend on, and it adds noise to the benchmarks.
current results out of env-load are a bit unrealistic: we monitor localhost which always returns 0ms. i want to test NMT/go-tsz in a realistic setting with fluctuating response times, but i also don't want to do a lot of io between devstack and some remote node on the internet, ideally all traffic should be contained within the machine where we run all tests. maybe something like https://github.com/raintank/bad-http-server which returns responses through a mimicked delayed network connection? @woodsaj thoughts? want to give this a shot?
i added a bunch of instrumentation (measure.sh for system stats cpu/mem, and graphite-watcher that checks if data returned by graphite looks superficially good (no nulls when there shouldn't be, data ordering, step, etc), these automatically start up in devstack when it launches and they come with dashboards preloaded in grafana. (tank branch only, has to be ported to master as well). i didn't tackle the subject of verifying data correctness yet. the ideal approach here I think would be sniffing points sent over nsq for a random subset of series, querying graphite, and verifying that it retuns the proper data. alternatively we could also just assume that unit tests give us enough confidence, or somehow feed in artificial data that is easy to visually verify by logging in and looking at graphs. but given the previous point, it seems better to use realistic data. @woodsaj thoughts?
https://github.com/raintank/raintank-metric/issues/37 needs to be fixed cause it affects the lag measurements.
other tickets that need fixing: https://github.com/raintank/raintank-docker/issues/42, https://github.com/raintank/graphite-raintank/issues/2, https://github.com/raintank/raintank-metric/issues/38

woodsaj commented 8 years ago

i want to test NMT/go-tsz in a realistic setting with fluctuating response times

You can try using kernels built in traffic control (tc) mechanism. I dont think you will be able to apply changes directly to individual containers. But it would work nicely inside a full VM, using virtualbox, kvm, or vmware.

http://www.linuxfoundation.org/collaborate/workgroups/networking/netem

woodsaj commented 8 years ago

I was able to get tc working with the docker containers. I added a commit to raintank-docker, so that network emulation rules are applied to the interfaces on all of the collector containers to increase their latency. https://github.com/raintank/raintank-docker/commit/145736ae75998db86e1a8343572c504b3d25d35b

The increased latency wont be applied if the endpoint is localhost. But will work for other addresses. So i would recommend that you change env-load to use the address of the docker0 interface instead of localhost.

Dieterbe commented 8 years ago

good stuff AJ. i just did https://github.com/raintank/raintank-docker/commit/1b6462560a9f6321b7ccfd6a1fc87f9b3780b228 but otherwise works great.

Dieterbe commented 8 years ago

the server now should have functioning rollups, i just have a hard time visually verifying because the metrics don't show up in the graphite-api output so i can't draw them in grafana.

Dieterbe commented 8 years ago

been making progress on serving consolidated responses in the http handler https://github.com/raintank/raintank-metric/pull/63 if anyone is curious to see the work in progress.

Dieterbe commented 8 years ago

with #63 gearing up it's time to start finding a great set of rollup intervals/settings. i know @woodsaj at one point had a google doc with a table of suggested rollup settings. can you share ( you shared "by the numbers v2" but that looks like smth else)

some thoughts:

I found that on the fly aggregation in Go is very fast. aggregating 1k values takes <1µs, so doing it 1k times, eg for each point on a graph still means no more than 1ms. this was on my laptop. presumably in graphite-web it's slow because it's python ? so the main value of computing offline aggregations will be for storage space and potentially faster network tx between c* and our graphite/tank stack. not so much faster graph rendering.
rollup data gives more bang for buck the higher the point reduction between the various stages (raw vs aggregation resolution 1 vs resolution 2 etc). <=10x difference in points/size does not seem worth it.
other than what we drive through litmus (and litmus should only be a fraction of the total viz being done on our tsdb), we don't really know what people will be looking at. grafana's timepicker gives a plethora of options and it seems we'll see a fairly random distribution of timeframes of several days, several weeks, months but >= 1 year seems less likely.
in litmus, people can configure an interval of 10/30/60/120s for raw data. for non-litmus data i expect 1s will become common too and litmus will start doing 1s too probably (for ping at least)
in the extreme case, people may want to see 4 years of data on a full HD screen, so 1920 pix wide. we can sufficiently serve this with about 1 point every 10px, so 192pix or 7days per point. I like the idea of having full days as aggregation buckets, not just for scenarios like the above (which are probably not very common) but I think they would be useful for smartSummarize() and the like as well. @woodsaj seems more of a fan of max 6h buckets, which I'm fine with until we have better insights. using 6h buckets we should be able to easily compute per-day and per-7 days stats as well at runtime. (42 points, 192 times to be precise for the above scenario)
so raw is 1~120s. max bucket is 21600s (6h), it's going to be an interesting exercise finding the right layers in between and why. we should of course also take into account how long we would store each interval for, which affects the number.
for length-of-interval, i always like to do like 9 days, 6 weeks, 14 months. so that you can do "how does today/yesterday compare to same day last week", "how does last week compare to same week a month ago", etc.
also allow multiple input resolutions, good fit of rollups for each. (i think high resolution should also stay fairly high resolution throughout different bands, whereas if data comes in at low resolution, we can also be more aggressive in our rollups) @woodsaj you've thought about this in the past right?

woodsaj commented 8 years ago

the amount of wall time needed to perform aggregation isnt the problem. The problem is CPU time. A production deployment wont be able to dedicated a CPU core for the full duration of time needed to process large sets of data. This means lots of context switches and L1,L2 cache misses. Having to aggregate millions of points (ie, avg Ping latency across all probe locations over 7Days) down to a few hundred is very resource intensive. Factor in that there are 10-20 of these queries per dashboard, and you quickly realize that computing aggregations on the fly is not an option. Just the savings in network transfer will be substantial. 7days of raw data for "litmus.google_com.*.ping.avg" is about 45Mb (Kairosdb data, not metric-tank), with a 1gb network interface it would take at best ~300ms to transfer. Add in disk read latency and you are probably already above 1second before you event start aggregating the data.

also allow multiple input resolutions, good fit of rollups for each. (i think high resolution should also stay fairly high resolution throughout different bands, whereas if data comes in at low resolution, we can also be more aggressive in our rollups)

If rollup periods are not the same across all series, then there needs to be an index of the rollup periods that each series has available. At this stage we need to optimize for code simplicity.

Dieterbe commented 8 years ago

Factor in that...you quickly realize

that makes sense. so would you say each step in the rollup interval should be about 10x of the previous? For example if maxDataPoints is 800 and we set minDataPoints to 80 (it seems proper to allow points anywhere from 1 point per pixel to 1 point every 10 pixels), then if we had rollup intervals all about multiples of 10, then we would always be able to find a matching rollup interval that can be served without doing any further reduction at runtime. right now i'm playing with these settings in devstack: https://github.com/raintank/raintank-docker/commit/55e10422b912c81e7014f0c7dd54c5af284a9a41 needs some further thinking of course and instrumentation in the code of how proper our retention intervals correspond to requests, how much overhead there is, etc. and @woodsaj can you share the table of numbers of rollup intervals/retentions you had?

the main issue i have at this point with my settings now is that the raw stream can have a very broad range of resolutions. for litmus currently 10s to 120s but non litmus or litmus in the future i can see anywhere from 1s to 1h, this means the compression rate of the first rollup interval varies wildly and the first rollup level can be anywhere from basically useless to drastically insufficient, still requiring major runtime aggregation.

but like you say we can get the basic form working first and then worry about per-series/per-customer adjustments later once we have a better understanding of what we want.

woodsaj commented 8 years ago

@Dieterbe if the doc i shared with you doesnt have what you are looking for, then the information no longer exists.

Dieterbe commented 8 years ago

playing with rollups like a pro

1) update your dev stack 2) the nsq tools still need to be manually compiled, sorry. use consolidation-at-read-time branch for raintank-metric 3) disable alerting in grafana config, it creates too many queries that make looking at the log annoying 4) add --log-level 0 to the metricTank command in the screen file. 5) and replace the tail with tail -f /var/log/raintank/nsq_metrics_tank.log | grep -v 'pushing value to agg' in the screen file 6) launch stack 7) kill graphite watcher for same reason as alerting 8) add an endpoint as grafana, use all defaults 9) run ./delay_collector.sh, ./delay_collector.sh dev1 500 10 and similar commands to change the latency profile at specific points in time and give it some time to run at specific latency profiles. 10) open the new rollups tester dashboard so you can focus on one particular series (new since https://github.com/raintank/raintank-docker/commit/ab35c6e8e0c0eb4a96005aaeebd49aad0cec624e) 11) looking at the dash and the metricTank log you can now do experiments with display interval, and see what kind of data it loads, and you can verify whether things look allright or not. rollups-testing

Dieterbe commented 8 years ago

here's a video where i show it off: https://vimeo.com/147804095 i also merged this into master.

Dieterbe commented 8 years ago

this has been implemented for a while.

grafana / metrictank

rollups. #21

summary

todo

summary

playing with rollups like a pro