Production usage: memory usage under load

behrad commented 10 years ago

In a simple load test, I saw graphite (carbon_cache) python process memory usage is rapidly eating memory (a few GBs in 5 minutes) and system went to heavy swapping until it was totally thumb

1) I've submitted this to graphite issue list, however can it be related to the docker image settings? 2) Any real experiences of this under heavy load? 3) Could influxDB image act better in production?

and thank you for your really nice work :+1:

ivantopo commented 10 years ago

@behrad thanks for bringing this up! My comments:

1) It might be related to some settings, just as a little disclaimer, we didn't create this docker image as something to be used in production but rather as a tool to speedup the process of getting metrics from Akka/Spray/Play! in a development environment... regardless of that, if there is something we can do to improve the image behavior for real world load then we will certainly proceed! did you get any advice on the graphite list? Please drop us the link to the question so we can follow up too!

2) Nope.. we used it on production for a few days as a experiment, but it was a secondary thing and we never got a very detailed view of how it was behaving.

3) I don't think so and we don't have a InfluxDB module yet! (working on it).

Hopefully with your help we can move this a bit forward and make it better for everyone, regards!

behrad commented 10 years ago

but it was a secondary thing

What is your production monitoring tool then? Activator console? I also saw a kamon-dashboard in your github but it seems old and not active.

I need also some clarifications on kamon metric meanings... Shall I open a new issue to discuss them?

nice to hear from you again @ivantopo

behrad commented 10 years ago

https://github.com/graphite-project/carbon/issues/318

ivantopo commented 10 years ago

We were using New Relic + some internal custom stuff... the activator console is meant to be used only in development, never use that in prod! Our kamon-dashboard has been there for a long time and we are very close to taking it to a decent state, keep an eye on https://github.com/kamon-io/Kamon/issues/64, right now we will focus on getting the current release out and then, next release will contain at least a basic version of our dashboard.

With regards to metric meanings, maybe this thread is of help, if you have any further questions probably replying there is the best way to follow up, regards!

behrad commented 9 years ago

@ivantopo 1) How the #ofProcessedMsgsPerActor can be achieved? I have 3 timers mailBoxSize/TimeInMailbox/ProcessingTime and one counter named errors I want to see rate of messages being handled in my actors.

2) Should I use mean_99_999 or upper_99_999 for mailbox-size if I want to monitor number of messages in my mailbox? :)

ivantopo commented 9 years ago

Hello @behrad, you can get the processed message counts by using the processing-time.count metric. With regards to monitoring the number of messages in a mailbox I would recommend displaying at least this 3 metrics together for a given mailbox:

upper
mean
lower

If you want more, add a few more upper_xx metrics, never use the mean_xx as it reflects the mean of all values bellow the xx percentile which might make you thing that everything is better than what it really is. If you need to take something out, take out the mean. When monitoring mailboxes we find it more important to see the bounds between which it moves (by plotting lower and upper) and some other metrics in the middle (like the mean) to give you an idea of where it might have been the most during certain period of time. If the lower starts going up, something is slowing down the actor and it likely drive the upper to a higher value, that's a pattern that, if present for some time without proper back pressure control, will lead to a OOM. If the upper goes down, whoever produces the work done by this actor is slowing down or the actor is magically working faster. You can correlate with the processing time metric to find out which of the two cases apply.

Finally, please note that this recommendations are specific for displaying the mailbox size and should not be applied for latency measurements such as processing-time and time-in-mailbox. We have a issue opened for quite some time to apply this changes to the docker image but couldn't do it yet. Keep an eye on that if you find yourself interested in that.

behrad commented 9 years ago

Perfect post @ivantopo , thank you for this nice clarification :heart:

related to the main issue, I should say

I've excluded many of my actors in my kamon metric filter (I constantly see grafana console stops showing metrics (kamon is pushing to statsD but grafana console shows nothing) if you are generating an actor for each request (actor per request pattern), I excluded them all )
I've disabled LogReporter, and SystemMetrics. Another two dumb questions: Whats exact usage of LogReporter? is SystemMetrics passing global system resources utilization or the JVM process owned ones?

I'll re-enable system metrics and run my stress tests, then I'll report if memory issue has passed away.

appreciate your patience :p

ivantopo commented 9 years ago

The LogReporter is just a quick tool to see metrics in your console, for development purposes. It's useful when you just want to see some quick numbers without setting up a external metrics backend.

The system metrics module reports both global and JVM specific metrics, you can see that it reports the "cpu" and "proc-cpu" representing global and JVM cpu usage respectively and some other metrics (heap, garbage collection, memory, network and context switches). If you need some more details on the module please drop us a line on the mailing list in order to keep this thread on-topic, regards!

kamon-io / docker-grafana-graphite

Production usage: memory usage under load #7