influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.58k stars 3.53k forks source link

`range()` does not filter correctly depending on year #24669

Closed paulheg closed 1 month ago

paulheg commented 6 months ago

Steps to reproduce: List the minimal actions needed to reproduce the behaviour.

I have a dataset where each point is years apart, the bucket was created in Version 2.0.7 with a shard group duration of 1000w. I am now on version 2.7.5

  1. The following query works as expected:
    
    rx = /(Apfel)$/

from(bucket: "common_two_grams") |> range(start: 1971-01-01, stop: 2020-01-01) |> filter(fn: (r) => r["_measurement"] == "frequency") |> filter(fn: (r) => r["_field"] == "match_count") |> filter(fn: (r) => r["w1"] =~ rx or r["w2"] =~ rx) |> group(columns: ["_time"]) |> sum() |> group()

2. The following query does not work as expected, **notice that only the start year changed from 1971 to 1970**.
```flux
rx = /(Apfel)$/

from(bucket: "common_two_grams")
    |> range(start: 1970-01-01, stop: 2020-01-01)
    |> filter(fn: (r) => r["_measurement"] == "frequency")
    |> filter(fn: (r) => r["_field"] == "match_count")
    |> filter(fn: (r) => r["w1"] =~ rx or r["w2"] =~ rx)
    |> group(columns: ["_time"])
    |> sum()
    |> group()

Expected behaviour: This is the correct data I am expecting for the first query:

#group,false,false,false,false
#datatype,string,long,dateTime:RFC3339,unsignedLong
#default,_result,,,
,result,table,_time,_value
,,0,1971-11-29T23:00:00Z,9940
,,0,1972-11-29T23:00:00Z,9080
[....]
,,0,2018-11-29T23:00:00Z,69394

Actual behaviour: The second query does not filter the time correctly, as the complete time series is now in the dataset. Notice how it begins with 1725

#group,false,false,false,false
#datatype,string,long,dateTime:RFC3339,unsignedLong
#default,_result,,,
,result,table,_time,_value
,,0,1725-11-29T23:06:32Z,42
,,0,1726-11-29T23:06:32Z,28
,,0,1727-11-29T23:06:32Z,71
,,0,1728-11-29T23:06:32Z,49
,,0,1729-11-29T23:06:32Z,81
[...]
,,0,2015-11-29T23:00:00Z,87953
,,0,2016-11-29T23:00:00Z,90141
,,0,2017-11-29T23:00:00Z,74672
,,0,2018-11-29T23:00:00Z,69394

When adding aggregateWindow into the query, the range seems to work again:

rx = /(Apfel)$/

from(bucket: "common_two_grams")
    |> range(start: 1970-01-01, stop: 2020-01-01)
    |> filter(fn: (r) => r["_measurement"] == "frequency")
    |> filter(fn: (r) => r["_field"] == "match_count")
    |> filter(fn: (r) => r["w1"] =~ rx or r["w2"] =~ rx)
    |> aggregateWindow(every: 1y, fn: last)
    |> group(columns: ["_time"])
    |> sum()
    |> group()

Environment info:

Running inside a docker container with version 2.7.5

philjb commented 6 months ago

I can confirm the behavior in 2.7. I believe it is related to that unix timestamp of zero is 1970-01-01T00:00:00Z. Timestamps before that are negative and it appears that is not handled correctly in the flux query (or execution).

If it is an issue with flux, you might be able to work around it with influxql. (I checked - influxql does better.)

We would welcome a pr fixing the issue if you are willing.

paulheg commented 6 months ago

I just checked if its an issue with flux by modifying the range_test.flux and it does not seem to be the case.

package universe_test

import "testing"
import "csv"

option now = () => 2030-01-01T00:00:00Z

inData =
    "
#datatype,string,long,dateTime:RFC3339,double,string,string,string,string
#group,false,false,false,false,true,true,true,true
#default,_result,,,,,,,
,result,table,_time,_value,_field,_measurement,cpu,host
,,0,1701-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
,,0,1969-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
,,0,1970-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
,,0,1971-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
"
outData =
    "
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,double,string,string,string,string
#group,false,false,true,true,false,false,true,true,true,true
#default,_result,,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,cpu,host
,,0,1970-01-01T00:00:00Z,2030-01-01T00:00:00Z,1970-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
,,0,1970-01-01T00:00:00Z,2030-01-01T00:00:00Z,1971-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
"

testcase range {
    got =
        csv.from(csv: inData)
            |> testing.load()
            |> range(start: 1970-01-01)
    want = csv.from(csv: outData)

    testing.diff(got, want)
}
paulheg commented 6 months ago

Hi @philjb, as far as I understand, flux is doing the filtering, I found the go code for that in the flux repository. Looking at the test from my last comment tells me maybe there is something wrong in the database and not the query language.

Can you give me a general direction where to search for the possible issue in the influxdb repository?

Best regards Paul

philjb commented 6 months ago

I'd recommend reducing your flux query to the smallest query that shows the behavior - for example, is the aggregate window essential to reproducing it? Then I'd look at the implementations of the storage interface type Store interface. All the requests include a Range *TimestampRange which stores the query time range. I suspect there is special handling (that is "wrong") when the range start time is zero.

paulheg commented 6 months ago

Quick update, I just used the FluxQL to select data between 1820 and 1830 and this works fine.

paulheg commented 6 months ago

Hi @philjb, I think I found the issue. My only problem now is that Im unable to build the project because of dependencies from the flux repository. Mainly because the https://github.com/SAP/go-hdb v.0 does not seem to exist anymore.

I have a separate Issue already opened: https://github.com/influxdata/flux/issues/5462. Sadly I cant verify if my fix is correct.

paulheg commented 6 months ago

I could verify the issue by running the binary on my local machine. I'm not sure as to why this check was done, but it seems to be the cause of the issue.

philjb commented 6 months ago

@paulheg - were you able to build a binary to test with locally? I can't tell if you were able to build based on your last comment.

Can you add a test for the scenario to your pr?

paulheg commented 6 months ago

@philjb I was able to build it locally, run it, and verify.

Just found a work around:

GOPROXY=https://proxy.golang.org/cached-only go mod download github.com/SAP/go-hdb@v0.14.1

I'll have a look, but I can't think of a useful test for this scenario right now. To check that the value I passed does not get modified?

paulheg commented 6 months ago

Can you add a test for the scenario to your pr?

There is now a test in the pr.