livingsocial / rearview

Timeseries data monitoring framework
Other
281 stars 31 forks source link

Interface slows browser to a crawl #40

Open shawnjgoff opened 10 years ago

shawnjgoff commented 10 years ago

When working on a dashboard with many monitors, the interface slows to a crawl. I suspect it is due to graphing. I would like to be able to show a list of the monitors without them trying to graph their data.

Right now, if I want to edit a monitor, I need to turn all the monitors on the dashboard off before expanding a monitor to edit it, otherwise, garbage collection happens every few seconds, which is maddening.

steveakers commented 10 years ago

Have you tried dividing your monitors into categories? This is one reason why we created the category functionality. Also, when I've seen this in the past the monitors were plotting way more data than was actually needed for a meaningful monitor (e.g. 1440 minutes graphed when the last 10 were used for an alert.) If this is the case, try graphing less data either by decreasing minutes back or by using the summarize feature (e.g. look at hourly instead of minutely.)

shawnjgoff commented 10 years ago

I don't see a way to move monitors to another dashboard or category.

I have 8 monitors; 4 get a minute of data, 4 get an hour of data, and each of them use all that data for the monitor. However, each of those monitors looks at about 16 servers currently (it will be around 30 as soon as it's all setup, and will grow automatically as new servers start sending in metrics); some of them are pulling multiple metrics for a monitor (e.g. to calculate free memory, I need the free, cached, and buffered metrics). I probably can use summarize to reduce the number of datapoints.

Even if I get the graphs usable again, they are actually completely useless to me, so it would still be nice to disable them. Graphite and a bunch of other tools can show me graphs, but the real strength in Rearview is the monitoring and alerting.

steveakers commented 10 years ago

This blog post shows you how to create categories: https://techblog.livingsocial.com/blog/2014/01/24/rearview-on-rails/. Once they are created, you can move monitors to other categories in the monitor's settings tab.

Are you looking at each server individually in your monitors? Meaning, will you fire an alert that will list any and all offending servers so you have to look at them all individually?

shawnjgoff commented 10 years ago

Thanks. I'll try the categories.

I use a wildcard like server.*.whatever.metric; this means it picks up all servers that are chucking data at us - new servers can be added without modification of the monitoring setup. In the monitor, I look at each server individually and note which ones failed in the alert. Here is my most basic monitor script:

servers = []
@timeseries.each do |series|
  # df(1) output differs from the collectd df plugin output
  # I want to alert at < 200MB free the number below
  # was obtained by looking at the df number of a
  # system with 171MB free, and using ratios to
  # figure out what 200MB would be.
  if series.values.select { |v| v < 460083798 }.any?
    servers << "#{series.label.split('.')[1]} (#{series.values.min})"
  end
end
if servers.any?
  raise "Low disk space on: %s" % servers.join(', ')
end
talbright commented 10 years ago

Did this fix your issue?

shawnjgoff commented 10 years ago

It alleviated the problem, but it's still annoying to use. I can't leave it open because even with just a single metric on a category, it's causing the browser to periodically pause every minute or so for several seconds. There are a couple of other pain points: it pauses the browser for about 10 seconds when it switches to a category, and it takes several seconds to render the settings panel (the title bar one, not the gear one).

shawnjgoff commented 10 years ago

Some of the initial rendering time was due to a bad interaction with a plugin. I created a new Firefox profile and captured a profile. In the fresh profile, it looks like it's pausing for a bit over 3 seconds. If having the profile file will help, I can provide it; it's 20MB. 2014-06-04-092343_1914x1055_scrot

steveakers commented 10 years ago

That would be helpful... thanks. Quick question, how many datapoints do your graphs plot collectively? Are you plotting a days worth of data for 10+ servers for example?

shawnjgoff commented 10 years ago

Posted the profile here: https://downloads.accns.com/rearview/rearview_profile.json .

I have 7 graphs. Two that each have 60 data points per series and 50 series, four graphs that have about 2 data points by about 20 series, and one graph that has a single series with 60 data points.

steveakers commented 10 years ago

For the graph that has 60 datapoints for 50 servers I wonder if using filter might be useful. Something like averageBelow(server.*.whatever.metric,200). This will limit the number of datapoints by limiting the servers to only those that will fail. Worth a shot anyway.

shawnjgoff commented 10 years ago

I may be able to make it better, but that won't work for me because different servers have different thresholds, and some of them don't have a minimum threshold. For now, I've pulled it back to the last 10 minutes instead of the last hour. It doesn't seem to have affected the pauses.

talbright commented 10 years ago

@shawnjgoff I like the idea of a monitor only mode, though I'm not sure how that should look in the UI yet. There's definitely room for improvement in performance both in the front and back-end. For the time being, I think what @steveakers is trying to do is get you to think about ways to lower the data volume coming back, while still remaining useful to you of course.