Closed Kenterfie closed 7 years ago
When you say 'latest git' is it really latest, i.e. 2aee2bc ? And it includes 1f56f515 and 71cfdac ?
Edit: include right commits.
I have everything updated 3 hours ago, so yes. The latest commit is 2aee2bc, but the problem was with 0.9.15 the same.
Small request needs >1000ms on webapp render host When we call the same render request over wget from the render host it needs only 180ms not 1000ms.
Sorry, looks really confusing, do not really understand which request takes what. Could you maybe give exact queries?
I have everything updated 3 hours ago, so yes. The latest commit is 2aee2bc, but the problem was with 0.9.15 the same.
Ah, then maybe it's something else. Did you have REMOTE_STORE_MERGE_RESULTS
enabled on 0.9.15 ?
Another (probably stupid) question - are you sure that you really running on master code now, and not on still 0.9.15 ?
In my case not necessary. I don't use replication between the two graphite hosts. Both hosts holding different data. I use the render node to have one api endpoint.
And yes, im very sure i use the latest version. I have compared my webapp folder with the latest git master.
On render host
Thu Sep 08 17:27:57 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.900179
Thu Sep 08 17:27:57 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.893999
Thu Sep 08 17:27:58 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.089880
Thu Sep 08 17:27:58 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.084184
Thu Sep 08 17:27:59 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.905862
Thu Sep 08 17:27:59 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.900468
Thu Sep 08 17:28:00 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.901901
Thu Sep 08 17:28:00 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.900201
Thu Sep 08 17:28:01 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.088676
Thu Sep 08 17:28:01 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.084956
Thu Sep 08 17:28:03 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.081588
Thu Sep 08 17:28:03 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.357989
Thu Sep 08 17:28:04 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.908793
Thu Sep 08 17:28:04 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.897626
Thu Sep 08 17:28:05 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.374054
Thu Sep 08 17:28:06 2016 :: Retrieval of sortByMaxima(summarize(..)) took 2.076580
Thu Sep 08 17:28:06 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.084878
Thu Sep 08 17:28:07 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.082674
Thu Sep 08 17:28:07 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.910757
Thu Sep 08 17:28:08 2016 :: Retrieval of sortByMaxima(summarize(..)) took 0.899178
Thu Sep 08 17:28:09 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.102875
Thu Sep 08 17:28:09 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.083390
Thu Sep 08 17:28:10 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.099460
Thu Sep 08 17:28:10 2016 :: Retrieval of sortByMaxima(summarize(..)) took 1.080182
And on the storage node the forwarded request from render host
Thu Sep 08 17:30:07 2016 :: Retrieval of .. took 0.013766
Thu Sep 08 17:30:08 2016 :: Retrieval of .. took 0.015371
Thu Sep 08 17:30:08 2016 :: Retrieval of .. took 0.015578
Thu Sep 08 17:30:11 2016 :: Retrieval of .. took 0.015855
Thu Sep 08 17:30:11 2016 :: Retrieval of .. took 0.008226
Thu Sep 08 17:30:12 2016 :: Retrieval of .. took 0.014218
Thu Sep 08 17:30:12 2016 :: Retrieval of .. took 0.011763
Thu Sep 08 17:30:13 2016 :: Retrieval of .. took 0.013829
Thu Sep 08 17:30:13 2016 :: Retrieval of .. took 0.017468
Thu Sep 08 17:30:14 2016 :: Retrieval of .. took 0.014424
Thu Sep 08 17:30:14 2016 :: Retrieval of .. took 0.012316
Thu Sep 08 17:30:15 2016 :: Retrieval of .. took 0.010825
Thu Sep 08 17:30:15 2016 :: Retrieval of .. took 0.015276
Thu Sep 08 17:30:17 2016 :: Retrieval of .. took 0.014506
Thu Sep 08 17:30:18 2016 :: Retrieval of .. took 0.013076
Thu Sep 08 17:30:18 2016 :: Retrieval of .. took 0.010940
Thu Sep 08 17:30:19 2016 :: Retrieval of .. took 0.010102
Thu Sep 08 17:30:21 2016 :: Retrieval of .. took 0.009397
Thu Sep 08 17:30:21 2016 :: Retrieval of .. took 0.010097
Thu Sep 08 17:30:22 2016 :: Retrieval of .. took 0.012833
Thu Sep 08 17:30:22 2016 :: Retrieval of .. took 0.014324
when i run the same query like above with wget from render host against storage host
time wget "example.de:1444/render=sortByMaxima(..)"
i get this timing.
real 0m0.045s
user 0m0.007s
sys 0m0.000s
something strange happens on the render host, what i dont understand. both storage hosts response under 100ms.
other ideas what cause the hugh delay between data gathering and data delivery?
Here are a exact query
Thu Sep 08 20:15:33 2016 :: Retrieval of sortByMaxima(summarize(statsite.machines.{xy462}.gz1029059_1.{memoryUsage,cpuUsage},'1d','avg')) took 1.083491
Ok i think i found the issue
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.194305
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.010299
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.010212
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.016453
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.192053
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.011103
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.016286
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 0.903806
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 1.907195
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 1.086445
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 0.903029
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 1.091595
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 0.924130
Retrieval of sortByMaxima(summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg')) took 0.902272
sortByMaxima is 10 times slower, without a reason in my opinion. the returned data is very small. should not take 1 second
Ok, the above code was distorted by caching. The problem still consist
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.942750
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.898640
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.903292
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 2.085402
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 0.900449
Retrieval of summarize(statsite.nodeXY.{memoryUsage,cpuUsage},'1d','avg') took 1.898584
Graphite clustering is quite complicated. I never tried to measure timings and compare them between frontend and backend. 0.9.15 works quite good, I didn't try it on master though, but other people using clustering on master too. Is that regression visible only on summarize() or globs or on any requests?
I had tried to use 0.9.15, but it has to many bugs which are only fixed in later commits and dont like to merge every necessary commit himself to get a "stable" version. The regression is visible on all kinds of requests. I have tried now different ones and all very slow compared to http request to storage host directly. I have written now a small script to bypass this behaviour in this specific subject, but i hope later a performance improved version is released and someone found the reason for it. When i had more time i would do it himself.
Small note I dont know, why no stable version is released since month to fix the most annoying bugs in 0.9.15
@Kenterfie - did you try 0.9.x branch instead of 0.9.15? Did I understand you right - cluster on a current master is still much slower than on 0.9.15?
I dont know, why no stable version is released since month to fix the most annoying bugs in 0.9.15
IIRC that 0.9.15 should be last 0.9.x release. If we had proper semver semantic that 0.9.16 should be bugfix release - but we do not have it. Maybe releasing 0.9.16 is a good idea. What do you think, @obfuscurity ?
Just test that on 0.9.15 (with proper waiting for 10 minutes between requests to mitigate the cache), for format=json
and quite complicated target with 3 globs - aliasByNode(nonNegativeDerivative(groupByNode(node*.*.blah*.value, 1, 'sumSeries')), 0)"
Results -
Backend 1 - 1.5 sec
Backend 2 - 1.6 sec
Frontend (with just 2 backends above) - 15.3 sec
As I said, clustering is hard. If you can avoid that - do that.
With a second request (when results are still in cache), it's much better - Backend 1 - 0.9 sec Backend 2 - 0.8 sec Frontend (with just 2 backends above) - 1.7 sec
I've faced issue with sortByMaxima() too, server went down due to lack of RAM.
How do you run graphite? You have to make sure, that pool of workers graphite talks to is different from pool of workers browsers/end clients talk to. Otherwise extensive wait times and deadlocks possible
@redbaron: it's a nice thing to have dedicated caches for graphite-web, but that's not mandatory. You can run cluster w/o any problems up to some extent on shared caches.
unfortunately it is mandatory. Simple example:
2 nodes 1 worker thread each, each is configured to talk to same port when making graphite -> graphite requests.
2 browsers requests come in, each Graphite-web tries to contact another graphite and these requests are waiting while thread becomes free and ready to serve request. Everything just deadlocks until timeouts start to kick in and when they do both requests return error.
It doesn't really matter how many threads you run, as long as there is enough activity from browsers/clients and given that each request there causes 2*(N-1) requests to other graphite servers (first to find metric and next to fetch it), it's not long until deadlocks like described above start happening.
That's really nice thought experiment, but I have 0 issues on my graphite clusters, at least with load 0.5 - 1.5 M metrics/min write and 2000-5000 K metrics/min read. So, no, that's not mandatory. :)
we had deadlocks all over, whole cluster was coming down to a halt until I separated pools
So, that's my point - YMMV. Nice to try separate pools in case of deadlocks, but 99% of users will never hit that problem. Also, IIRC you have quite a big load, around 6M/min, right? even hitting cache problems in go-carbon - maybe that's case.
Wondering also how you can hit deadlocks on twisted-based daemon... Maybe because of some race conditions or other bugs in carbon or twisted, under some high load.
My setup: 8 nodes, 12M metrics/minute received with replication factor of 2, so 24M /minute persisted, graphite is run with apache + mod_wsgi, but I can't see how twisted will be different here.
there are not that many requests coming from browsers: 1200-1500 / minute, but graphs are quite big, with json response sizes being as high as 500KB (10% are >225KB).
I configure apache with 2 WSGI process groups on ports 8000 and 8010, both pointing to the same graphite install with same with CLUSTER_SERVERS
in local_settings.py pointing to port 8010. Browsers/clients go to port 8000 only.
This way workers on port 8000 always waiting for workers on ports 8010 and never other way aroud. It eliminated lockups and improved responsiveness (although I didn't measure response time effect, so can't prove it with numbers)
@redbaron Can you provide a diagram of your setup?
8 nodes, 12M metrics/minute received with replication factor of 2, so 24M /minute persisted
24M per 8 nodes, i.e. 3M per node, right?
there are not that many requests coming from browsers: 1200-1500 / minute, but graphs are quite big, with json response sizes being as high as 500KB (10% are >225KB).
Do you have maxDataPoints implemented btw? Just checked - even on 6-month graph I see 100-200K responses max.
I configure apache with 2 WSGI process groups on ports 8000 and 8010, both pointing to the same graphite install with same with CLUSTER_SERVERS in local_settings.py pointing to port 8010.
Ah, you mean separate pools of graphite-web in WSGI config for a client and intra-cluster setup? Got it. I thought separate carbon daemon for carbonlink queries and writes. I never have this problem on my load, nor on Apache + WSGI nor on Nginx + Gunicorn - but I think you can simply increase a number of workers on port 8000.
Can confirm this with the upgrade to master, I pulled yesterday to ed04b23. The only difference being 0.9.15 to master. Render times are noticeably through the roof. I didn't change the config and had CLUSTER_SERVERS enabled in the past to merge results. I'll have a graph of the performance degradation soon, but it's with all queries, not just sortByMaxima. I can produce the results with a single metric with no transforms.
Actually, I lied, I didn't have request timing in Apache enabled, nor did I have render timing enabled in graphite web.
I don't know if anyone noticed this the first time, but @cbowman0's https://github.com/graphite-project/graphite-web/issues/1063#issuecomment-240763266 appears to demonstrate a 2s latency in cluster requests in the best case and in one case (datastore_3
), it looks like 40s.
@bmhatfield does that look correct to you?
That cluster wasn't in the best of states when I ran those tests. It's my disaster recovery/standby cluster that needed/needs reconfiguration to be viable. It has full input load for carbon, but essentially no read load so I could do isolated traffic capture of graphite-web requests.
More than a couple carbon caches on at least datastore_3 were at MAX_CACHE_SIZE and that seems to cause the carbon link lookups to timeout frequently on my carbon build (old megacarbon fork). Since local fetches to whisper files are done serially, and thus carbonlink lookups for reach one, that request was expected to be extra long. Essentially, those numbers are probably closer to worse case than best.
I've been meaning to upgrade graphite-web to after the merge result set changes were merged. I will attempt to get some timings of some queries to see if I can see any difference in performance in my setup.
Ok that's fair. I think it's still fair to acknowledge we have some performance issues afa parallel cluster requests. I expect to be fairly occupied over the next couple days but I'll try to setup a test cluster scenario this weekend.
Render times are noticeably through the roof. I didn't change the config and had CLUSTER_SERVERS enabled in the past to merge results.
@reyjrar, could you please test does REMOTE_STORE_MERGE_RESULTS = False make things better for master or not? I'm trying to find is that a mail culprit now or it's because of lack of parallel requests. Have plans to test rendering on master too on one of my clusters, but not sure if will have time this week...
@deniszh REMOTE_STORE_MERGE_RESULTS = False had zero effect
Here's render times with CLUSTER_SERVERS commented out:
Thu Sep 15 14:22:48 2016 :: Total rendering time 0.161674 seconds Thu Sep 15 14:22:50 2016 :: Total rendering time 0.169428 seconds
Adding it back in with REMOTE_STORE_MERGE_RESULTS = False
Thu Sep 15 14:23:39 2016 :: Total rendering time 4.806623 seconds Thu Sep 15 14:23:48 2016 :: Total rendering time 4.845406 seconds
Also, for posterity, timing when REMOTE_STORE_MERGE_RESULTS = True
Thu Sep 15 14:27:02 2016 :: Total rendering time 4.854576 seconds Thu Sep 15 14:27:15 2016 :: Total rendering time 4.806077 seconds
FWIW, I just git reset --hard 6e6c77abd2fad247d97b6a0adce8b595d2f7daba to rule out @redbaron 's patch. The bug exists prior to that commit.
My setup: 8 carbon-cache instances per node, 4 nodes. carbon-c-relays in front doing aggregation and sending to all 8 carbon-caches. Graphite-web is configured to talk to the same 8 backend carbon-caches.
Re-reading this thread, it sounds like I need to spin up extra cache instances for graphite-web?
What's weird is, if the problem existed on 0.9.15, it was not as pronounced. IE, I updated earlier this week and we instantly began to see timeouts on grafana dashaboards communicating to the graphite-web servers. This is why I chirped up here. There may be existing deadlocks, but they were rare in my 0.9.15 setup. They are now common place building off master.
@reyjrar - could you please test this PR https://github.com/graphite-project/graphite-web/pull/1699 ? I tried to tag you there but seems misread your github handle...
Sure thing, I'll test tomorrow morning.
Just a general note that this is a hard blocker for the next release.
I have plans to use newly-opensourced pyflame profiler http://eng.uber.com/pyflame/ to find where culprit is.
I've been looking at this for a while and I'm not able to reproduce anything that suggests a threading performance or deadlock issue. @Kenterfie Is it possible that the pickle rendering (or network transit) is taking longer than expected?
Here are some sample datapoints requesting one large query at a time. There's a single frontend with three backend storage nodes.
C0 - laptop client
B0 - backend storage
B1 - backend storage
B2 - backend storage
F0 - frontend render
(C0) $ curl -k "https://${F0}/render/?target=collectd.*.cpu-*.cpu-*&format=json" 2>&1 >/dev/null
(B0) Wed Oct 05 15:12:25 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.364455
(B0) Wed Oct 05 15:12:25 2016 :: Total pickle rendering time 0.424098
(B1) Wed Oct 05 15:12:25 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.556552
(B1) Wed Oct 05 15:12:25 2016 :: Total pickle rendering time 0.634059
(B2) Wed Oct 05 15:12:23 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.185500
(B2) Wed Oct 05 15:12:23 2016 :: Total pickle rendering time 0.206219
(F0) Wed Oct 05 15:12:25 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 1.124653
(F0) Wed Oct 05 15:12:27 2016 :: Total json rendering time 3.208207
(B0) Wed Oct 05 15:14:24 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.334817
(B0) Wed Oct 05 15:14:24 2016 :: Total pickle rendering time 0.392128
(B1) Wed Oct 05 15:14:24 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.467109
(B1) Wed Oct 05 15:14:24 2016 :: Total pickle rendering time 0.541729
(B2) Wed Oct 05 15:14:22 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.166958
(B2) Wed Oct 05 15:14:22 2016 :: Total pickle rendering time 0.186709
(F0) Wed Oct 05 15:14:24 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.802444
(F0) Wed Oct 05 15:14:26 2016 :: Total json rendering time 2.822548
(B0) Wed Oct 05 15:50:51 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.307690
(B0) Wed Oct 05 15:50:51 2016 :: Total pickle rendering time 0.366291
(B1) Wed Oct 05 15:50:51 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.498106
(B1) Wed Oct 05 15:50:51 2016 :: Total pickle rendering time 0.573213
(B2) Wed Oct 05 15:50:49 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.170795
(B2) Wed Oct 05 15:50:49 2016 :: Total pickle rendering time 0.190676
(F0) Wed Oct 05 15:50:51 2016 :: Retrieval of collectd.*.cpu-*.cpu-* took 0.888467
(F0) Wed Oct 05 15:50:53 2016 :: Total json rendering time 2.889416
Note that I believe that something is causing a performance regression for you guys, but it's been very difficult to reproduce the conditions in such a way that will point to precisely where we should begin looking.
Apologies, I misunderstood the problem. I thought we were trying to track down a gap in time between the backend retrieval finished and the rendering began. I know see that we're talking about how long retrieval takes on the frontend node for local metrics when clustering is enabled.
Regardless, I'm still not seeing any "regression" between master and 0.9.15. Both systems behave identically for the tests I've performed.
Thanks @obfuscurity for your tests. This shows me the exact same behavior like i noticed. The regression is very low, but the problem itself exist. The time differences between backend and frontend are in my opinion to high. In my case mostly factor 10. I dont know if you want to fix this in the next version, but currently is it impossible to use the graphite cluster like intended. I have to use the backends directly to bypass the performance issue.
@Kenterfie My results do not correlate to your findings. The Retrieval
timer reported by the frontend represents the total (or at least, the time required to merge the required union) of all CLUSTER_SERVERS
(and local) retrievals. Hence, it's only natural that disabling CLUSTER_SERVERS
on the frontend would result in drastically reduced query times (it's only retrieving locally).
This is what I find so frustrating... I'm unable to reproduce anything that resembles your or @reyjrar's results.
Further testing based on slack call with @obfuscurity :
https://gist.github.com/reyjrar/4c0ad075a528d4d68e14b8b1427a99be
Setup is 2 local nodes in each DC, 2 DC's.
Testing the effect of CLUSTER_SERVERS recursively
# Front with CLUSTER_SERVERS, Backend w/o
Thu Oct 06 14:39:07 2016 :: Retrieval of carbon.agents.*.updateOperations took 0.116185
Thu Oct 06 14:39:07 2016 :: Rendered PNG in 0.275612 seconds
Thu Oct 06 14:39:07 2016 :: Total rendering time 0.393449 seconds
# Front and Back ends with CLUSTER_SERVERS set to eachother's IPs
Thu Oct 06 14:37:25 2016 :: Retrieval of carbon.agents.*.updateOperations took 0.059602
Thu Oct 06 14:37:26 2016 :: Rendered PNG in 0.266713 seconds
Thu Oct 06 14:37:26 2016 :: Total rendering time 0.327662 seconds
These results weren't statistically significant.
I'm wondering if there's an optimization to be done, ie, keep-alive or connection pooling? I can only venture a guess that the whole TCP setup/teardown to the CLUSTER_SERVERS is happening before the results are returned and/or merged?
So, I noticed this performance hit is real in /metrics/find
as well. So I diverted my resources there. I think, and I might be wrong, that the caching in remote_storage.py is broken. I am never seeing a cached result.
I don't have MEMCACHED_HOSTS set, so I noticed that my CACHES dict is empty. I am attempting to override that local_settings.py with:
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.db.DatabaseCache',
'LOCATION': 'caching',
}
}
Then I ran:
django-admin.py --pythopath=<WEBAPP> --settings=graphite.settings createcachetable caching
This created the caching table in my database. But after several minutes of making requests, I'm not seeing anything in the caching table of my database.
UPDATE: None issue, please ignore, caching is irrelevant and I did not import the django core settings. Once I enabled that cache setting, caching worked, but again this isn't the heart of the issue.
It would appear that the remote fetches are not happening in parallel, but sequentially. Note the results in the gist below: the first few (one for each of my CLUSTER_SERVERS
) represent the initial connection setup; subsequent fetches reuse these connections and are much faster, but all of them are still blocking in order.
@Kenterfie would you mind applying the datalib.py
patch here and gisting your results?
https://gist.github.com/obfuscurity/9f5f6c59d700a6ec1ec41585bc8ecaf8
I don't think that is how it works, @obfuscurity. My understanding is that the loop starting with for node, results in fetches:
is serializing the consumption of the results of the fetch requests. Those will always happen in sequential order and the first one (or few) should take longer than the rest, since they are going to block on the remote store responding. To calculate how long each remote fetch takes (along with the start/stop times), the instrumentation would need to be added to node.fetch()
or within the fetches = [(node, node.fetch(startTime, endTime)) for node in matching_nodes if node.is_leaf]
line as that's when the start of the remote request happens.
The datalib.py
code in master is rather different from the one in the 0.9.x tree, but the major parts I see are:
With regards to # 1, I think that's a wash because the subsequent fetch() will result in a local find() call anyways on the remote nodes and that one should be cached and/or faster because the data would be in kernel's disk cache. We could test it bypassing those find calls by converting STORE.remote_stores into the datatype that matching_nodes list contains.
For # 2, I'm not sure where in the master code to reimplement that feature. Maybe in datalib.py:fetch_data() we can loop over matching_nodes and extract all metrics found, do the bulk carbonlink and pass that cached data to every WhisperReader.fetch(), as an example?
For # 3, I don't know enough about the implementation of that to know if it would make a difference here.
@cbowman0 I absolutely agree that the per-node fetches are serialized. However, if you look at my logs, it would appear that all of the fetches are serial, according to the timestamps (one does not start until the previous is finished). I have a hard time believing that none of these fetches would have staggered run times. Please correct me if you think I'm misreading the output.
Configuration OS: Ubuntu 14.04 Webserver: nginx WSGI: uwsgi Graphite: graphite (latest git)
2 Graphite storage server + 1 Graphite Webapp Render Host (only json output)
Issue We save our data on two different graphite storage hosts over statsite + carbon-c-relay . Everything is working well, but one problem we notice is, that every request needs very long on webapp render host.
Rendering performance Small request needs >1000ms on webapp render host, but the real rendering on the storage host needs only ~15ms. What does graphite webapp the other 980ms on the render host?
When we call the same render request over wget from the render host it needs only 180ms not 1000ms. Someone else notice render performance issues like this? I havnt found any reason in graphite currently, what cause it.