vimeo / graph-explorer

A graphite dashboard powered by structured metrics
http://vimeo.github.io/graph-explorer/
Apache License 2.0
1.06k stars 93 forks source link

Enable multi-workers, and make data loading faster #12

Closed sykp241095 closed 11 years ago

sykp241095 commented 11 years ago

I cache these data("targets_all", "graphs_all") to disk files when run update_metrics.py, then delete the urlopen in update_metrics.py, because app.py will reload these data by comparing mtime of these files when response /index.

When I tested multi-workers with gunicorn, I found it would refresh_data in a SINGLE worker. then I make app.py compare the mtime of these data files in each request to /index, if these files were modified, reload them.

It also works in a single worker as before, but some minutes faster, :)

Dieterbe commented 11 years ago

The performance of the computations of the metrics datastructures are currently the single most PITA when using graph-explorer with a large amount of metrics. this is currently a single threaded operation that could be easily parrallelized (by splitting up the metrics list in chunks). Another fix that I've been looking into, is creating a tag database (in something like sqlite) instead of python memory structures, this should be more efficient in a few areas (possibly in addition to multi-threading) Ideally, long term, however, metrics should be submitted into graphite as a list of key/value pairs, instead of a "dumb" string. I've started working towards this goal but it's gonna take me a while. (see for example https://answers.launchpad.net/graphite/+question/223956). Once that's in place we can get rid of structured_metrics and many things will be easier. (tag based metrics in graphite are the future, and everyone will benefit)

Anyway, looking at your PR:

1) I like the removal of lambda functions (and all configure functions) when storing metrics: basically we should only store what's fundamental to the metric (graphite metric, target, tags). that suffices to use the metric. the configure functions were only kept so that you could see them when inspecting a metric (to debug what happened), but that's a separate, secondary concern. since metrics have a plugin tag, it's fairly to look up this info after the fact anyway.

2) I don't understand how your commit does anything wrt parallelisation (multi-workers? how does that work? i've never used gunicorn. would it be better to always run in gunicorn instead of paste as i do now?)

3) Can you explain more in detail how the speeds are achieved?

thanks

sykp241095 commented 11 years ago

Hi:

to 2): Gunicorn and paste have the same function in this case: managing workers, single or multiple.

Without my commit, I can also start multiple worker with paste or gunicorn, but these workers can not hold the same data(targets_all, graphs_all) in the same time.

For example I start 2 workers A and B with paste server, when I "GET /refresh_data" to rebuild data, Worker A will serve my request, and "targets_all" and "graphs_all" will be rebuild in worker A, but they are still old in worker B, becase B did't serve my "GET /refresh_data" this time.

to 3): The speed is archived ONLY when rebuild data(targets_all, graphs_all) by these steps:

(1) Instead of rebuilding data in web app worker, I build these data in cron(update_metrics.py), when this cron runs, I not only download metrics, but also rebuild these two big data structure "targets_all" and "graphs_all", then "pickle.dumps" them to disk files with module "pickle".

(2) Instead of urllib.urlopen('/refresh_data'), in each request, workers will TRY TO reload data(targets_all and graphs_all dumped just now) by comparing timestamp of files dumped just now with that value(timestamp) stored last time reloaded. And there is not too much overhead when comparing timestamp.

(3) Finally, workers can easily RELOAD data by "pickle.loads" disk files dumped just now. In my case, it will cost 3 seconds when there are 20, 000 metrics.

PS: As pickle can not dumps lambda, I copied targets_all, and deleted lambdas from the copy.

sykp241095 commented 11 years ago

We have already use this patch in our production environment for a week, it seems good.

Dieterbe commented 11 years ago

hey so like i mentioned earlier, i wanted to get rid of the lambda functions earlier on. as it turns out this was pretty easy to implement and i pushed it to master: https://github.com/Dieterbe/graph-explorer/commit/c01a8e9844a5d81962bddbdb02374cfdb655911d so now you can pickle the list directly.

thanks for the explanation, that clears up a lot for me.

I'm personally very inexperienced with webservers such as paste and gunicorn. it probably makes sense to have a config option that specifies the number of workers, but that would probably be a different implementation for paste vs gunicorn.. maybe a note in the deployment section in the readme about multiple workers?

I'm suffering a lot from GE being unresponsive, and I wonder if maybe that's because of the old refresh_data implementation, I hope your patch fixes that. So I'll try it out.

Btw I think there's a race condition, I added a comment to the code line.

sykp241095 commented 11 years ago

Because the old 'refresh_data' need a lot computing( regex matching ), it will capture the whole computer as a result of PIL in python, so the worker will be unresponsive.

The pickle.dumps & pickle.loads or saving these data to database as you said will help you to make this faster.

Good luck !

Dieterbe commented 11 years ago

I merged it, applied some fixes (please check the commit log and have a look), but there's still some bugs that have to be fixed:

1) when you run update_metrics, the log messages emitted from backend.update_data(s_metrics) aren't shown.

2) if you have no cache files present, then start GE, then run update_metrics.py (successfully), but go to a page like /debug, /debug/metrics, /inspect, it will keep waiting for the data files to be loaded, but they will never load because only the index page does that.

3) (would be nice to have some documentation on how to use multiple workers, pref. with paste)

Dieterbe commented 11 years ago

wow.. github closed this automatically and doesn't let me reopen it. @huoxy can you reopen?

sykp241095 commented 11 years ago

I can not reopen it either...

to 3) I have used gunicorn

In my case, I edited app.py and append 'app = bottle.default_app()' to it, and then run app.py with gunicorn like this without graph-explorer.py :

    # /usr/bin/gunicorn -w 4 app:app -b 0.0.0.0:8080

or edit graph-explorer.py :

    run('app', host=config.listen_host, port=config.listen_port, server='gunicorn', workers = 4)

then it will start 4 processes.

see this: http://blog.yprez.com/running-a-bottle-app-with-gunicorn.html

Dieterbe commented 11 years ago

ok thanks, i created separate issues (see above) for these.