go-graphite / graphite-clickhouse

Graphite cluster backend with ClickHouse support
MIT License
213 stars 52 forks source link

Internal aggregation is using incorrect from and until unix timestamps when maxDataPoint is used #248

Closed iankosen closed 1 year ago

iankosen commented 1 year ago

I am noticing a potential problem with the values that are used for from and until when maxdatapoints is provided. Below is the query I am running and my analysis of what is happening.

I am sending the query from carbon-api. The query is: http://localhost:8081/render?format=json&target=metric_name&from=-30d&maxDataPoints=1 The From and Until values resolve to 1694837820(Sat Sep 16 2023 04:17:00 GMT+0000) and 1697429820(Mon Oct 16 2023 04:17:00 GMT+0000) respectively.

In this section we recompute from and until so that it's divisible by step https://github.com/go-graphite/graphite-clickhouse/blob/1db4b3d740c0f3e72ed8a3449603f4a016b5dda8/render/data/query.go#L349.

After the calculation is done for the graphite query above the from and until values are updated to 1695168000(Wed Sep 20 2023 00:00:00 GMT+0000) and 1697759999(Thu Oct 19 2023 23:59:59 GMT+0000) respectively. The value of step is 2592000(one month).

From the above we can see that the from and until values are off by about 4days. This leads to incorrect values being aggregated. The problem gets worse as the range between from and until increases.

Felixoid commented 1 year ago

The only reason for this I see it's the logical error with the storage retention. Data point precision must be 1day or less

precision – How precisely to define the age of the data in seconds. Should be a divisor for 86400 (seconds in a day).

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/graphitemergetree#patterns

ikosenn commented 1 year ago

The retention policy I am using is:

<clickhouse>
        <pattern>
            <regexp>regex_exp</regexp>
            <function>sum</function>
            <retention>
                <age>0</age>
                <precision>15</precision>
            </retention>
            <retention>
                <age>108000</age>
                <precision>30</precision>
            </retention>
            <retention>
                <age>3456000</age>
                <precision>300</precision>
            </retention>
            <retention>
                <age>8640000</age>
                <precision>1800</precision>
            </retention>
            <retention>
                <age>34128000</age>
                <precision>3600</precision>
            </retention>
        </pattern>
</clickhouse>

All the precision values are less than a day and are a divisor for 86400.

Felixoid commented 1 year ago

Then we need to investigate this place https://github.com/go-graphite/graphite-clickhouse/blob/1db4b3d740c0f3e72ed8a3449603f4a016b5dda8/render/data/query.go#L147C9-L147C9

It calculates the common step before the from/until is set, so let's add some trace logging there.

Also, what does the query in CH return? SELECT * FROM system.graphite_retentions

Do you use auto for rollup-config?

Felixoid commented 1 year ago

I've totally missed the initial query, sorry.

http://localhost:8081/render?format=json&target=metric_name&from=-30d&maxDataPoints=1

You get what you requested. Using maxDataPoints=100 will change it to another step.

ikosenn commented 1 year ago

Then we need to investigate this place https://github.com/go-graphite/graphite-clickhouse/blob/1db4b3d740c0f3e72ed8a3449603f4a016b5dda8/render/data/query.go#L147C9-L147C9

It calculates the common step before the from/until is set, so let's add some trace logging there.

Also, what does the query in CH return? SELECT * FROM system.graphite_retentions

Do you use auto for rollup-config?

I use rollup-config = auto. I dont think the rollup config has an issue. The rStep it gets here is always correct.

I've totally missed the initial query, sorry.

http://localhost:8081/render?format=json&target=metric_name&from=-30d&maxDataPoints=1

You get what you requested. Using maxDataPoints=100 will change it to another step.

That is correct. Calculating the step works as expected. The only issue I have is the update it does to the from and until dates when maxDataPoints is used. You can see from the first message the from and until values for the query are 4 days ahead after setFromUntil is executed. The issue is more obvious when Until - From is large (recommend at least 30days) and maxDataPoints is small e.g 1

c.From = 1694837820
c.Until = 1697429820
c.step = 2592000
# computed from and until in setFromUntil function will be:
c.from = 1695168000  # (four days after original c.From)
c.until = 1697759999 # (four days after original c.Until)

This causes the clickhouse to do an aggregation over the wrong time range.

Felixoid commented 1 year ago

It's aligned as expected. Totally correct result.

In [4]: 1694837820 // 2592000
Out[4]: 653  # the timestamp 1692576000 is out of the range originalFrom:originalUntil

In [5]: 2592000 * 654  # to get the ANY point after the original From
Out[5]: 1695168000

In [6]: 2592000 * 655
Out[6]: 1697760000  #  that's Until+1, to not get the second point

The purpose of the maxDataPoints is to limit the points. So the step, from, and until are calculated accordingly.

If you feel it's broken, it's not.

Felixoid commented 1 year ago

I can continue to describe how it works and why, but the behavior itself is correct and aligned between graphite-clickhouse and carbonapi

The calculation of step/from/until worked like this for 3 years and was reviewed by other maintainers. And it's tested by many people and companies. Changing it will break the world.