graphite-project / graphite-web

A highly scalable real-time graphing system
http://graphite.readthedocs.org/
Apache License 2.0
5.89k stars 1.26k forks source link

[Q] Performance is slower when using consolidateBy function and maxdatapoints #2704

Closed viknet365 closed 3 years ago

viknet365 commented 3 years ago

Describe the bug When querying a graphite metric over 150d, I was expecting to use the consolidateBy function and the maxdatapoints options to speed up the query: Both approach failed

To Reproduce

When decreasing the datapoints number the request take longer. When trying to leverage this performance (using the consolidatebyfunction last ) the results are equivalent:

root@nuc:~$ time wget   --post-data="target=openhab.consommeHP_Fours&from=-150d&until=now&format=json&maxDataPoints=1000"  http://192.168.2.200/render
--2021-05-24 14:14:56--  http://192.168.2.200/render
Connexion à 192.168.2.200:80… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 25924 (25K) [application/json]
real    0m0,804s
user    0m0,001s
sys     0m0,009s

root@nuc:~$ time wget   --post-data="target=openhab.consommeHP_Fours&from=-150d&until=now&format=json&maxDataPoints=100"  http://192.168.2.200/render
--2021-05-24 14:15:01--  http://192.168.2.200/render
Connexion à 192.168.2.200:80… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 2820 (2,8K) [application/json]
real    0m2,017s
user    0m0,006s
sys     0m0,005s

root@nuc:~$ time wget   --post-data="target=openhab.consommeHP_Fours&from=-150d&until=now&format=json&maxDataPoints=10"  http://192.168.2.200/render
--2021-05-24 14:15:06--  http://192.168.2.200/render
Connexion à 192.168.2.200:80… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 404 [application/json]
real    0m9,091s
user    0m0,009s
sys     0m0,001s

root@nuc:~$ time wget   --post-data="target=consolidateBy(openhab.consommeHP_Fours%2C'last')&from=-150d&until=now&format=json&maxDataPoints=1000&"  http://192.168.2.200/render
--2021-05-24 14:15:34--  http://192.168.2.200/render
Connexion à 192.168.2.200:80… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 23727 (23K) [application/json]
real    0m0,921s
user    0m0,004s
sys     0m0,009s

root@nuc:~$ time wget   --post-data="target=consolidateBy(openhab.consommeHP_Fours%2C'last')&from=-150d&until=now&format=json&maxDataPoints=100&"  http://192.168.2.200/render
--2021-05-24 14:15:40--  http://192.168.2.200/render
Connexion à 192.168.2.200:80… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 2519 (2,5K) [application/json]
real    0m1,806s
user    0m0,001s
sys     0m0,010s

root@nuc:~$ time wget   --post-data="target=consolidateBy(openhab.consommeHP_Fours%2C'last')&from=-150d&until=now&format=json&maxDataPoints=10&"  http://192.168.2.200/render
--2021-05-24 14:15:46--  http://192.168.2.200/render
Connexion à 192.168.2.200:80… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 391 [application/json]
real    0m9,175s
user    0m0,001s
sys     0m0,010s

Expected behavior 1) I was expecting reducing the maxdatapoint would reduce the query time (I can understand this one because graphite has to calculate all the average, but it is surprising) 2) I was expecting the consolidateBy function with 'last' or 'first' to be quicker than the default 'avg' 3) I was expecting faster queries on this database.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context I am using the following retention :

retentions =  30s:30d,60s:180d,120s:1095d,600s:3650d

and the wsp file info return this:

root@XXX:/var/lib/docker/volumes/graphite-storage/_data/whisper/openhab# whisper-info consommeHP_Fours.wsp
aggregationMethod: average
maxRetention: 315360000
xFilesFactor: 0.5
fileSize: 19915264

Archive 0
offset: 64
secondsPerPoint: 30
points: 86400
retention: 2592000
size: 1036800

Archive 1
offset: 1036864
secondsPerPoint: 60
points: 259200
retention: 15552000
size: 3110400

Archive 2
offset: 4147264
secondsPerPoint: 120
points: 788400
retention: 94608000
size: 9460800

Archive 3
offset: 13608064
secondsPerPoint: 600
points: 525600
retention: 315360000
size: 6307200
deniszh commented 3 years ago

Hi @viknet365 ,

Unfortunately, nor consolidateBy nor maxDataPoints not designed to help with query performance. All functions and approximations are done after fetching all data, so, it will only add more work instead of reducing it.

viknet365 commented 3 years ago

Hi @deniszh, thank you for your answer, so I understand that by design the full file is fetched in memory before doing calculation, so I can understand in memory calculation can add some overhead, but in this case, it's a ten time overhead (1s for 1000 points vs 10s for 10 points, and this time is the same when using 'avg' and 'first'/'last' (in this case it should only cherry pick some data and not doing calculation).

Furthermore, If I try to use whisper-fetch ( quite slow, it takes 2,2s) and awk to average the result (take 200ms to average all the line on a the same computer) :

root@nuc:/var/lib/docker/volumes/graphite-storage/_data/whisper/openhab# time awk '{ total += $2; count++ } END { print total/count }' ./consommeHP_Fours.txt
691537

real    0m0,274s
user    0m0,256s
sys     0m0,017s

and we all agree that both tool are not designed for performance !!!! There is maybe some good reason for such low performance in my use case (low CPU, low memory use of docker or a combination of them ), but I am quite puzzled....

viknet365 commented 3 years ago

behaviour is the same with graphite 1.0.2-3 (4 years old) so the behaviour has not changed for many years (so its probably not a bug). still, very surprised with this. After trial and error I have set many of my grafana maxdatapoints to 2500. I remove the BUG tag but If anyone has time to explaining this behaviour I will be happy.

piotr1212 commented 3 years ago

I guess the consolidate function is not written with performance in mind. See https://github.com/graphite-project/graphite-web/blob/master/webapp/graphite/render/datalib.py#L100-L135 . The list operations could be done faster by just summing. Anyway, if you really care about performance you'd better check out the go-graphite stack.

viknet365 commented 3 years ago

thank for the answer and advice, I realy like golang but I was a bit skeptical about go-graphite and specially I was not sure it implement all graphite function (there are tons and keeping two project like this synchronized is not easy) . I will try and give performance results I obtain. In the meantime I tried hacking datalib.py to see where is the bottleneck. I first remove the use of the intermediate use of the buffer buf[] and end up changing :

def __iter__(self):
    if self.valuesPerPoint > 1:
      return self.__consolidatingGenerator( list.__iter__(self) )
    else:
      return list.__iter__(self)

by this

def __iter__(self):
    return list.__iter__(self)

completely removing the use of the __consolidatingGenerator function (we all understand it won't give meaningfull data but it was just a test) and the performance result are roughly the same (less than a 2s gain ), so it seems the bottleneck is not in these two functions

viknet365 commented 3 years ago

same hardware, same docker, same wsp file but using go-carbon instead of graphite-web:

root@nuc:~$ time wget   --post-data="target=openhab.consommeHP_Fours&from=-150d&until=now&format=json&maxDataPoints=10&"  http://192.168.2.200:81/render
--2021-05-25 16:24:50--  http://192.168.2.200:81/render
Connexion à 192.168.2.200:81… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 373 [application/json]
real    0m0,011s
user    0m0,007s
sys     0m0,003s
root@nuc:~$ time wget   --post-data="target=openhab.consommeHP_Fours&from=-150d&until=now&format=json&maxDataPoints=100&"  http://192.168.2.200:81/render
--2021-05-25 16:26:48--  http://192.168.2.200:81/render
Connexion à 192.168.2.200:81… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : non indiqué [application/json]
real    0m0,102s
user    0m0,001s
sys     0m0,009s
root@nuc:~$ time wget   --post-data="target=openhab.consommeHP_Fours&from=-150d&until=now&format=json&maxDataPoints=1000&"  http://192.168.2.200:81/render
--2021-05-25 16:26:55--  http://192.168.2.200:81/render
Connexion à 192.168.2.200:81… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : non indiqué [application/json]
real    0m0,101s
user    0m0,004s
sys     0m0,005s

I will see how it behave in real life use........

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.