Webapp cluster performance issue

Kenterfie commented 8 years ago

Configuration OS: Ubuntu 14.04 Webserver: nginx WSGI: uwsgi Graphite: graphite (latest git)

2 Graphite storage server + 1 Graphite Webapp Render Host (only json output)

Issue We save our data on two different graphite storage hosts over statsite + carbon-c-relay . Everything is working well, but one problem we notice is, that every request needs very long on webapp render host.

Rendering performance Small request needs >1000ms on webapp render host, but the real rendering on the storage host needs only ~15ms. What does graphite webapp the other 980ms on the render host?

When we call the same render request over wget from the render host it needs only 180ms not 1000ms. Someone else notice render performance issues like this? I havnt found any reason in graphite currently, what cause it.

deniszh commented 8 years ago

When you say 'latest git' is it really latest, i.e. 2aee2bc ? And it includes 1f56f515 and 71cfdac ?

Edit: include right commits.

Kenterfie commented 8 years ago

I have everything updated 3 hours ago, so yes. The latest commit is 2aee2bc, but the problem was with 0.9.15 the same.

deniszh commented 8 years ago

Small request needs >1000ms on webapp render host When we call the same render request over wget from the render host it needs only 180ms not 1000ms.

Sorry, looks really confusing, do not really understand which request takes what. Could you maybe give exact queries?

deniszh commented 8 years ago

I have everything updated 3 hours ago, so yes. The latest commit is 2aee2bc, but the problem was with 0.9.15 the same.

Ah, then maybe it's something else. Did you have REMOTE_STORE_MERGE_RESULTS enabled on 0.9.15 ?

deniszh commented 8 years ago

Another (probably stupid) question - are you sure that you really running on master code now, and not on still 0.9.15 ?

Kenterfie commented 8 years ago

In my case not necessary. I don't use replication between the two graphite hosts. Both hosts holding different data. I use the render node to have one api endpoint.

And yes, im very sure i use the latest version. I have compared my webapp folder with the latest git master.

Kenterfie commented 8 years ago

On render host

Thu Sep 08 17:27:57 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.900179
Thu Sep 08 17:27:57 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.893999
Thu Sep 08 17:27:58 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.089880
Thu Sep 08 17:27:58 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.084184
Thu Sep 08 17:27:59 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.905862
Thu Sep 08 17:27:59 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.900468
Thu Sep 08 17:28:00 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.901901
Thu Sep 08 17:28:00 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.900201
Thu Sep 08 17:28:01 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.088676
Thu Sep 08 17:28:01 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.084956
Thu Sep 08 17:28:03 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.081588
Thu Sep 08 17:28:03 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.357989
Thu Sep 08 17:28:04 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.908793
Thu Sep 08 17:28:04 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.897626
Thu Sep 08 17:28:05 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.374054
Thu Sep 08 17:28:06 2016 :: Retrieval of sortByMaxima(summarize(..)) took 2.076580
Thu Sep 08 17:28:06 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.084878
Thu Sep 08 17:28:07 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.082674
Thu Sep 08 17:28:07 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.910757
Thu Sep 08 17:28:08 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.899178
Thu Sep 08 17:28:09 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.102875
Thu Sep 08 17:28:09 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.083390
Thu Sep 08 17:28:10 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.099460
Thu Sep 08 17:28:10 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.080182

And on the storage node the forwarded request from render host

Thu Sep 08 17:30:07 2016 :: Retrieval of .. took 0.013766
Thu Sep 08 17:30:08 2016 :: Retrieval of .. took 0.015371
Thu Sep 08 17:30:08 2016 :: Retrieval of .. took 0.015578
Thu Sep 08 17:30:11 2016 :: Retrieval of .. took 0.015855
Thu Sep 08 17:30:11 2016 :: Retrieval of .. took 0.008226
Thu Sep 08 17:30:12 2016 :: Retrieval of .. took 0.014218
Thu Sep 08 17:30:12 2016 :: Retrieval of .. took 0.011763
Thu Sep 08 17:30:13 2016 :: Retrieval of .. took 0.013829
Thu Sep 08 17:30:13 2016 :: Retrieval of .. took 0.017468
Thu Sep 08 17:30:14 2016 :: Retrieval of .. took 0.014424
Thu Sep 08 17:30:14 2016 :: Retrieval of .. took 0.012316
Thu Sep 08 17:30:15 2016 :: Retrieval of .. took 0.010825
Thu Sep 08 17:30:15 2016 :: Retrieval of .. took 0.015276
Thu Sep 08 17:30:17 2016 :: Retrieval of .. took 0.014506
Thu Sep 08 17:30:18 2016 :: Retrieval of .. took 0.013076
Thu Sep 08 17:30:18 2016 :: Retrieval of .. took 0.010940
Thu Sep 08 17:30:19 2016 :: Retrieval of .. took 0.010102
Thu Sep 08 17:30:21 2016 :: Retrieval of .. took 0.009397
Thu Sep 08 17:30:21 2016 :: Retrieval of .. took 0.010097
Thu Sep 08 17:30:22 2016 :: Retrieval of .. took 0.012833
Thu Sep 08 17:30:22 2016 :: Retrieval of .. took 0.014324

when i run the same query like above with wget from render host against storage host time wget "example.de:1444/render=sortByMaxima(..)" i get this timing.

real    0m0.045s
user    0m0.007s
sys 0m0.000s

something strange happens on the render host, what i dont understand. both storage hosts response under 100ms.

Kenterfie commented 8 years ago

other ideas what cause the hugh delay between data gathering and data delivery?

Kenterfie commented 8 years ago

Here are a exact query

Thu Sep 08 20:15:33 2016 :: Retrieval of sortByMaxima(summarize(statsite.machines.{xy462}.gz1029059_1.{memoryUsage,cpuUsage},'1d','avg')) took 1.083491

Kenterfie commented 8 years ago

Ok i think i found the issue

Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.194305
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.010299
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.010212
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.016453
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.192053
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.011103
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.016286
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 0.903806
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 1.907195
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 1.086445
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 0.903029
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 1.091595
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 0.924130
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 0.902272

sortByMaxima is 10 times slower, without a reason in my opinion. the returned data is very small. should not take 1 second

Kenterfie commented 8 years ago

Ok, the above code was distorted by caching. The problem still consist

Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.942750
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.898640
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.903292
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 2.085402
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.900449
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 1.898584

deniszh commented 8 years ago

Graphite clustering is quite complicated. I never tried to measure timings and compare them between frontend and backend. 0.9.15 works quite good, I didn't try it on master though, but other people using clustering on master too. Is that regression visible only on summarize() or globs or on any requests?

Kenterfie commented 8 years ago

I had tried to use 0.9.15, but it has to many bugs which are only fixed in later commits and dont like to merge every necessary commit himself to get a "stable" version. The regression is visible on all kinds of requests. I have tried now different ones and all very slow compared to http request to storage host directly. I have written now a small script to bypass this behaviour in this specific subject, but i hope later a performance improved version is released and someone found the reason for it. When i had more time i would do it himself.

Small note I dont know, why no stable version is released since month to fix the most annoying bugs in 0.9.15

deniszh commented 8 years ago

@Kenterfie - did you try 0.9.x branch instead of 0.9.15? Did I understand you right - cluster on a current master is still much slower than on 0.9.15?

I dont know, why no stable version is released since month to fix the most annoying bugs in 0.9.15

IIRC that 0.9.15 should be last 0.9.x release. If we had proper semver semantic that 0.9.16 should be bugfix release - but we do not have it. Maybe releasing 0.9.16 is a good idea. What do you think, @obfuscurity ?

deniszh commented 8 years ago

Just test that on 0.9.15 (with proper waiting for 10 minutes between requests to mitigate the cache), for format=json and quite complicated target with 3 globs - aliasByNode(nonNegativeDerivative(groupByNode(node*.*.blah*.value, 1, 'sumSeries')), 0)" Results - Backend 1 - 1.5 sec Backend 2 - 1.6 sec Frontend (with just 2 backends above) - 15.3 sec As I said, clustering is hard. If you can avoid that - do that.

With a second request (when results are still in cache), it's much better - Backend 1 - 0.9 sec Backend 2 - 0.8 sec Frontend (with just 2 backends above) - 1.7 sec

mizhka commented 7 years ago

I've faced issue with sortByMaxima() too, server went down due to lack of RAM.

redbaron commented 7 years ago

How do you run graphite? You have to make sure, that pool of workers graphite talks to is different from pool of workers browsers/end clients talk to. Otherwise extensive wait times and deadlocks possible

deniszh commented 7 years ago

@redbaron: it's a nice thing to have dedicated caches for graphite-web, but that's not mandatory. You can run cluster w/o any problems up to some extent on shared caches.

redbaron commented 7 years ago

unfortunately it is mandatory. Simple example:

2 nodes 1 worker thread each, each is configured to talk to same port when making graphite -> graphite requests.

2 browsers requests come in, each Graphite-web tries to contact another graphite and these requests are waiting while thread becomes free and ready to serve request. Everything just deadlocks until timeouts start to kick in and when they do both requests return error.

It doesn't really matter how many threads you run, as long as there is enough activity from browsers/clients and given that each request there causes 2*(N-1) requests to other graphite servers (first to find metric and next to fetch it), it's not long until deadlocks like described above start happening.

deniszh commented 7 years ago

That's really nice thought experiment, but I have 0 issues on my graphite clusters, at least with load 0.5 - 1.5 M metrics/min write and 2000-5000 K metrics/min read. So, no, that's not mandatory. :)

redbaron commented 7 years ago

we had deadlocks all over, whole cluster was coming down to a halt until I separated pools

deniszh commented 7 years ago

So, that's my point - YMMV. Nice to try separate pools in case of deadlocks, but 99% of users will never hit that problem. Also, IIRC you have quite a big load, around 6M/min, right? even hitting cache problems in go-carbon - maybe that's case.

deniszh commented 7 years ago

Wondering also how you can hit deadlocks on twisted-based daemon... Maybe because of some race conditions or other bugs in carbon or twisted, under some high load.

redbaron commented 7 years ago

My setup: 8 nodes, 12M metrics/minute received with replication factor of 2, so 24M /minute persisted, graphite is run with apache + mod_wsgi, but I can't see how twisted will be different here.

there are not that many requests coming from browsers: 1200-1500 / minute, but graphs are quite big, with json response sizes being as high as 500KB (10% are >225KB).

I configure apache with 2 WSGI process groups on ports 8000 and 8010, both pointing to the same graphite install with same with CLUSTER_SERVERS in local_settings.py pointing to port 8010. Browsers/clients go to port 8000 only.

This way workers on port 8000 always waiting for workers on ports 8010 and never other way aroud. It eliminated lockups and improved responsiveness (although I didn't measure response time effect, so can't prove it with numbers)

obfuscurity commented 7 years ago

@redbaron Can you provide a diagram of your setup?

deniszh commented 7 years ago

8 nodes, 12M metrics/minute received with replication factor of 2, so 24M /minute persisted

24M per 8 nodes, i.e. 3M per node, right?

there are not that many requests coming from browsers: 1200-1500 / minute, but graphs are quite big, with json response sizes being as high as 500KB (10% are >225KB).

Do you have maxDataPoints implemented btw? Just checked - even on 6-month graph I see 100-200K responses max.

I configure apache with 2 WSGI process groups on ports 8000 and 8010, both pointing to the same graphite install with same with CLUSTER_SERVERS in local_settings.py pointing to port 8010.

Ah, you mean separate pools of graphite-web in WSGI config for a client and intra-cluster setup? Got it. I thought separate carbon daemon for carbonlink queries and writes. I never have this problem on my load, nor on Apache + WSGI nor on Nginx + Gunicorn - but I think you can simply increase a number of workers on port 8000.

reyjrar commented 7 years ago

Can confirm this with the upgrade to master, I pulled yesterday to ed04b23. The only difference being 0.9.15 to master. Render times are noticeably through the roof. I didn't change the config and had CLUSTER_SERVERS enabled in the past to merge results. I'll have a graph of the performance degradation soon, but it's with all queries, not just sortByMaxima. I can produce the results with a single metric with no transforms.

Actually, I lied, I didn't have request timing in Apache enabled, nor did I have render timing enabled in graphite web.

Here's a gist with request timing on master

obfuscurity commented 7 years ago

I don't know if anyone noticed this the first time, but @cbowman0's https://github.com/graphite-project/graphite-web/issues/1063#issuecomment-240763266 appears to demonstrate a 2s latency in cluster requests in the best case and in one case (datastore_3), it looks like 40s.

@bmhatfield does that look correct to you?

cbowman0 commented 7 years ago

That cluster wasn't in the best of states when I ran those tests. It's my disaster recovery/standby cluster that needed/needs reconfiguration to be viable. It has full input load for carbon, but essentially no read load so I could do isolated traffic capture of graphite-web requests.

More than a couple carbon caches on at least datastore_3 were at MAX_CACHE_SIZE and that seems to cause the carbon link lookups to timeout frequently on my carbon build (old megacarbon fork). Since local fetches to whisper files are done serially, and thus carbonlink lookups for reach one, that request was expected to be extra long. Essentially, those numbers are probably closer to worse case than best.

I've been meaning to upgrade graphite-web to after the merge result set changes were merged. I will attempt to get some timings of some queries to see if I can see any difference in performance in my setup.

obfuscurity commented 7 years ago

Ok that's fair. I think it's still fair to acknowledge we have some performance issues afa parallel cluster requests. I expect to be fairly occupied over the next couple days but I'll try to setup a test cluster scenario this weekend.

deniszh commented 7 years ago

Render times are noticeably through the roof. I didn't change the config and had CLUSTER_SERVERS enabled in the past to merge results.

@reyjrar, could you please test does REMOTE_STORE_MERGE_RESULTS = False make things better for master or not? I'm trying to find is that a mail culprit now or it's because of lack of parallel requests. Have plans to test rendering on master too on one of my clusters, but not sure if will have time this week...

reyjrar commented 7 years ago

@deniszh REMOTE_STORE_MERGE_RESULTS = False had zero effect

Here's render times with CLUSTER_SERVERS commented out:

Thu Sep 15 14:22:48 2016 :: Total rendering time 0.161674 seconds Thu Sep 15 14:22:50 2016 :: Total rendering time 0.169428 seconds

Adding it back in with REMOTE_STORE_MERGE_RESULTS = False

Thu Sep 15 14:23:39 2016 :: Total rendering time 4.806623 seconds Thu Sep 15 14:23:48 2016 :: Total rendering time 4.845406 seconds

Also, for posterity, timing when REMOTE_STORE_MERGE_RESULTS = True

Thu Sep 15 14:27:02 2016 :: Total rendering time 4.854576 seconds Thu Sep 15 14:27:15 2016 :: Total rendering time 4.806077 seconds

reyjrar commented 7 years ago

FWIW, I just git reset --hard 6e6c77abd2fad247d97b6a0adce8b595d2f7daba to rule out @redbaron 's patch. The bug exists prior to that commit.

reyjrar commented 7 years ago

My setup: 8 carbon-cache instances per node, 4 nodes. carbon-c-relays in front doing aggregation and sending to all 8 carbon-caches. Graphite-web is configured to talk to the same 8 backend carbon-caches.

Re-reading this thread, it sounds like I need to spin up extra cache instances for graphite-web?

What's weird is, if the problem existed on 0.9.15, it was not as pronounced. IE, I updated earlier this week and we instantly began to see timeouts on grafana dashaboards communicating to the graphite-web servers. This is why I chirped up here. There may be existing deadlocks, but they were rare in my 0.9.15 setup. They are now common place building off master.

deniszh commented 7 years ago

@reyjrar - could you please test this PR https://github.com/graphite-project/graphite-web/pull/1699 ? I tried to tag you there but seems misread your github handle...

reyjrar commented 7 years ago

Sure thing, I'll test tomorrow morning.

obfuscurity commented 7 years ago

Just a general note that this is a hard blocker for the next release.

deniszh commented 7 years ago

I have plans to use newly-opensourced pyflame profiler http://eng.uber.com/pyflame/ to find where culprit is.

obfuscurity commented 7 years ago

I've been looking at this for a while and I'm not able to reproduce anything that suggests a threading performance or deadlock issue. @Kenterfie Is it possible that the pickle rendering (or network transit) is taking longer than expected?

Here are some sample datapoints requesting one large query at a time. There's a single frontend with three backend storage nodes.

C0 - laptop client
B0 - backend storage
B1 - backend storage
B2 - backend storage
F0 - frontend render

(C0) $ curl -k "https://${F0}/render/?target=collectd.*.cpu-*.cpu-*&format=json" 2>&1 >/dev/null

(B0) Wed Oct 05 15:12:25 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.364455
(B0) Wed Oct 05 15:12:25 2016 :: Total pickle rendering time 0.424098
(B1) Wed Oct 05 15:12:25 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.556552
(B1) Wed Oct 05 15:12:25 2016 :: Total pickle rendering time 0.634059
(B2) Wed Oct 05 15:12:23 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.185500
(B2) Wed Oct 05 15:12:23 2016 :: Total pickle rendering time 0.206219
(F0) Wed Oct 05 15:12:25 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 1.124653
(F0) Wed Oct 05 15:12:27 2016 :: Total json rendering time 3.208207

(B0) Wed Oct 05 15:14:24 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.334817
(B0) Wed Oct 05 15:14:24 2016 :: Total pickle rendering time 0.392128
(B1) Wed Oct 05 15:14:24 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.467109
(B1) Wed Oct 05 15:14:24 2016 :: Total pickle rendering time 0.541729
(B2) Wed Oct 05 15:14:22 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.166958
(B2) Wed Oct 05 15:14:22 2016 :: Total pickle rendering time 0.186709
(F0) Wed Oct 05 15:14:24 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.802444
(F0) Wed Oct 05 15:14:26 2016 :: Total json rendering time 2.822548

(B0) Wed Oct 05 15:50:51 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.307690
(B0) Wed Oct 05 15:50:51 2016 :: Total pickle rendering time 0.366291
(B1) Wed Oct 05 15:50:51 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.498106
(B1) Wed Oct 05 15:50:51 2016 :: Total pickle rendering time 0.573213
(B2) Wed Oct 05 15:50:49 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.170795
(B2) Wed Oct 05 15:50:49 2016 :: Total pickle rendering time 0.190676
(F0) Wed Oct 05 15:50:51 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.888467
(F0) Wed Oct 05 15:50:53 2016 :: Total json rendering time 2.889416

Note that I believe that something is causing a performance regression for you guys, but it's been very difficult to reproduce the conditions in such a way that will point to precisely where we should begin looking.

obfuscurity commented 7 years ago

Apologies, I misunderstood the problem. I thought we were trying to track down a gap in time between the backend retrieval finished and the rendering began. I know see that we're talking about how long retrieval takes on the frontend node for local metrics when clustering is enabled.

obfuscurity commented 7 years ago

Regardless, I'm still not seeing any "regression" between master and 0.9.15. Both systems behave identically for the tests I've performed.

Kenterfie commented 7 years ago

Thanks @obfuscurity for your tests. This shows me the exact same behavior like i noticed. The regression is very low, but the problem itself exist. The time differences between backend and frontend are in my opinion to high. In my case mostly factor 10. I dont know if you want to fix this in the next version, but currently is it impossible to use the graphite cluster like intended. I have to use the backends directly to bypass the performance issue.

obfuscurity commented 7 years ago

@Kenterfie My results do not correlate to your findings. The Retrieval timer reported by the frontend represents the total (or at least, the time required to merge the required union) of all CLUSTER_SERVERS (and local) retrievals. Hence, it's only natural that disabling CLUSTER_SERVERS on the frontend would result in drastically reduced query times (it's only retrieving locally).

This is what I find so frustrating... I'm unable to reproduce anything that resembles your or @reyjrar's results.

reyjrar commented 7 years ago

Further testing based on slack call with @obfuscurity :

https://gist.github.com/reyjrar/4c0ad075a528d4d68e14b8b1427a99be

Setup is 2 local nodes in each DC, 2 DC's.

reyjrar commented 7 years ago

Testing the effect of CLUSTER_SERVERS recursively

# Front with CLUSTER_SERVERS, Backend w/o
Thu Oct 06 14:39:07 2016 :: Retrieval of carbon.agents.*.updateOperations took 0.116185
Thu Oct 06 14:39:07 2016 :: Rendered PNG in 0.275612 seconds
Thu Oct 06 14:39:07 2016 :: Total rendering time 0.393449 seconds

# Front and Back ends with CLUSTER_SERVERS set to eachother's IPs
Thu Oct 06 14:37:25 2016 :: Retrieval of carbon.agents.*.updateOperations took 0.059602
Thu Oct 06 14:37:26 2016 :: Rendered PNG in 0.266713 seconds
Thu Oct 06 14:37:26 2016 :: Total rendering time 0.327662 seconds

These results weren't statistically significant.

reyjrar commented 7 years ago

I'm wondering if there's an optimization to be done, ie, keep-alive or connection pooling? I can only venture a guess that the whole TCP setup/teardown to the CLUSTER_SERVERS is happening before the results are returned and/or merged?

reyjrar commented 7 years ago

So, I noticed this performance hit is real in /metrics/find as well. So I diverted my resources there. I think, and I might be wrong, that the caching in remote_storage.py is broken. I am never seeing a cached result.

I don't have MEMCACHED_HOSTS set, so I noticed that my CACHES dict is empty. I am attempting to override that local_settings.py with:

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.db.DatabaseCache',
        'LOCATION': 'caching',
    }
}

Then I ran: django-admin.py --pythopath=<WEBAPP> --settings=graphite.settings createcachetable caching

This created the caching table in my database. But after several minutes of making requests, I'm not seeing anything in the caching table of my database.

UPDATE: None issue, please ignore, caching is irrelevant and I did not import the django core settings. Once I enabled that cache setting, caching worked, but again this isn't the heart of the issue.

obfuscurity commented 7 years ago

It would appear that the remote fetches are not happening in parallel, but sequentially. Note the results in the gist below: the first few (one for each of my CLUSTER_SERVERS) represent the initial connection setup; subsequent fetches reuse these connections and are much faster, but all of them are still blocking in order.

@Kenterfie would you mind applying the datalib.py patch here and gisting your results?

https://gist.github.com/obfuscurity/9f5f6c59d700a6ec1ec41585bc8ecaf8

cbowman0 commented 7 years ago

I don't think that is how it works, @obfuscurity. My understanding is that the loop starting with for node, results in fetches: is serializing the consumption of the results of the fetch requests. Those will always happen in sequential order and the first one (or few) should take longer than the rest, since they are going to block on the remote store responding. To calculate how long each remote fetch takes (along with the start/stop times), the instrumentation would need to be added to node.fetch() or within the fetches = [(node, node.fetch(startTime, endTime)) for node in matching_nodes if node.is_leaf] line as that's when the start of the remote request happens.

The datalib.py code in master is rather different from the one in the 0.9.x tree, but the major parts I see are:

In 0.9.x, STORE.find is only called for the local store. All of the remote nodes are sent the fetch request without sending a find() first. In master, we reduce the remote stores that we issue the fetch request to by first sending a find request in the line that assigns matching_nodes.
In 0.9.x the carbonlink supports a bulk query option, but master does not.
0.9.x has as prefetch of remote data, but master does not.

With regards to # 1, I think that's a wash because the subsequent fetch() will result in a local find() call anyways on the remote nodes and that one should be cached and/or faster because the data would be in kernel's disk cache. We could test it bypassing those find calls by converting STORE.remote_stores into the datatype that matching_nodes list contains.

For # 2, I'm not sure where in the master code to reimplement that feature. Maybe in datalib.py:fetch_data() we can loop over matching_nodes and extract all metrics found, do the bulk carbonlink and pass that cached data to every WhisperReader.fetch(), as an example?

For # 3, I don't know enough about the implementation of that to know if it would make a difference here.

obfuscurity commented 7 years ago

@cbowman0 I absolutely agree that the per-node fetches are serialized. However, if you look at my logs, it would appear that all of the fetches are serial, according to the timestamps (one does not start until the previous is finished). I have a hard time believing that none of these fetches would have staggered run times. Please correct me if you think I'm misreading the output.

graphite-project / graphite-web

Webapp cluster performance issue #1690