influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.57k stars 3.53k forks source link

fill(previous) should find most recent value, even if outside query time range #6878

Open beckettsean opened 8 years ago

beckettsean commented 8 years ago

Feature Request

Proposal: [Description of the feature]

When executing fill(previous) the query should always have a value for previous, even if there is no point with that field in the query time range.

Current behavior: [What currently happens]

> select * from fp
name: fp
--------
time            value
2016-06-20T16:09:13Z    10
2016-06-20T16:19:13Z    100

> select max(value) from fp where time > now() - 20m group by time(5m) fill(previous)
name: fp
--------
time            max
2016-06-20T16:05:00Z    10
2016-06-20T16:10:00Z    10
2016-06-20T16:15:00Z    100
2016-06-20T16:20:00Z    100
2016-06-20T16:25:00Z    100

> select max(value) from fp where time > now() - 18m group by time(5m) fill(previous)
name: fp
--------
time            max
2016-06-20T16:10:00Z    
2016-06-20T16:15:00Z    100
2016-06-20T16:20:00Z    100
2016-06-20T16:25:00Z    100

Note the null value for the 16:10-16:15 bucket, despite there being a point at 16:09 with a value.

Desired behavior: [What you would like to happen]

> select * from fp
name: fp
--------
time            value
2016-06-20T16:09:13Z    10
2016-06-20T16:19:13Z    100

> select max(value) from fp where time > now() - 20m group by time(5m) fill(previous)
name: fp
--------
time            max
2016-06-20T16:05:00Z    10
2016-06-20T16:10:00Z    10
2016-06-20T16:15:00Z    100
2016-06-20T16:20:00Z    100
2016-06-20T16:25:00Z    100

> select max(value) from fp where time > now() - 18m group by time(5m) fill(previous)
name: fp
--------
time            max
2016-06-20T16:10:00Z    10
2016-06-20T16:15:00Z    100
2016-06-20T16:20:00Z    100
2016-06-20T16:25:00Z    100

Use case: [Why is this important (helps with prioritizing requests)]

Currently customers have to know when the last value was recorded in order to make sure that point is included in the time range. For irregular series that's a significant burden. If the system can always find the most recent value regardless of the lower time bound, then many state change queries become useful.

jwheeler-gs commented 8 years ago

Are there any good work-arounds for this right now? I'm collecting sparse data and trying to graph it using fill(previous), which results in the first several values being null because the previous value falls outside the desired query time range.

The only thing I can think to do right now is to execute a second query to get last(field) with the time ending at the start time of the above query, then use that result to fill in the null values.

beckettsean commented 8 years ago

@jwheeler-gs that's the best workaround for now

jsternberg commented 8 years ago

I have a crude solution to this in the branch for #5943. The proper solution to this requires #5943 to be implemented, but the crude solution will be good enough for 1.0 if I can't get the proper solution working in time.

beckettsean commented 8 years ago

what happens if fill(previous) is used but there was no data in any interval? Presumably we could pull the tag set for the previous point and use that? Specifically wondering about https://github.com/influxdata/influxdb/issues/6967

jsternberg commented 8 years ago

I... have no idea. I will have to get back to you on that.

jwheeler-gs commented 8 years ago

If there was no previous data, wouldn't it make sense to just return null until the first value is encountered? That would be the same as the current behavior but only in the case where there are no previous values.

jsternberg commented 8 years ago

@jwheeler-gs yes, that is the current behavior and would be for this too. The issue is what happens when there is no data in that interval at all. Right now, fill will not fill a series that doesn't exist in the interval. @beckettsean was asking what would happen if fill(previous) was used and there was no data in the interval, but there was data in the past at some point for that series.

jwheeler-gs commented 8 years ago

Aaah, I get it now. That's actually going to be a possible case for my use as well. I'd expect (hope for) fill to return that previous value for the entire interval. But then again, I'd also expect it to not return any data past the current point in time.

I'm using fill to provide data directly to a chart which can be adjusted to show any time window, past, present, or even future (where future data is filled in using prediction data from another dataset). What I really want is to see the previous value up to some specified point (the present timestamp) and then no data past that. This is probably a bit much to ask of influx and is outside the scope of what it really needs to provide. That can still be handled easily enough on the receiving end by nulling out all values past the current timestamp.

retorquere commented 8 years ago

Would it be possible to add fill(previous) for non-grouped queries? Something like

select GPS.latitude, GPS.longitude, Weather.temperature from car where time >=... and time <=... fill(previous)

Weather.temperature is written to the DB sparsely, and the reader must assume that as long as no new temperature is reported, the previous value holds.

I know I can do

select GPS.latitude, GPS.longitude, Weather.temperature from car where time >=... and time <=... group by time(1s) fill(previous)

but that slows down the request considerably, and I get loads of empty lines for times in the time range where there are no measurements.

beckettsean commented 8 years ago

@retorquere the issue with having fill(previous) with no GROUP BY time() clause is that the system doesn't know when to fill. Should it return one point per nanosecond? Per second? A group by interval is needed to create a regular time series, so that the "missing" points are clear.

retorquere commented 8 years ago

@beckettsean I will defer to your superior knowledge on the matter of course, but conceptually, I'd figure it would return exactly the same points as with a regular non-grouped selected, just with the nulls filled in by the value in the column in one of the rows already selected.

beckettsean commented 8 years ago

@retorquere InfluxDB does not store nulls. There are no nulls returned in a non-grouped query.

retorquere commented 8 years ago

If I submit this however:

curl -G 'http://localhost:8086/query?pretty=true' --data-urlencode "db=mydb" --data-urlencode "q=select GPS.latitude, Weather.temperature from car where time >= '2015-10-13T14:16:13Z' limit 10"

I get this (where I'd love for there to be a way to have those nulls be replaced by 5)

{
    "results": [
        {
            "series": [
                {
                    "name": "car",
                    "columns": [
                        "time",
                        "GPS.latitude",
                        "Weather.temperature"
                    ],
                    "values": [
                        [
                            "2015-10-13T14:16:14Z",
                            51.9893696,
                            5
                        ],
                        [
                            "2015-10-13T14:16:15Z",
                            51.9893696,
                            null
                        ],
                        [
                            "2015-10-13T14:16:16Z",
                            51.9893696,
                            null
                        ],
                        [
                            "2015-10-13T14:16:17Z",
                            51.9893696,
                            null
                        ],
                        [
                            "2015-10-13T14:16:18Z",
                            51.9893696,
                            null
                        ],
                        [
                            "2015-10-13T14:16:19Z",
                            51.9893696,
                            null
                        ],
                        [
                            "2015-10-13T14:16:20Z",
                            51.9893696,
                            null
                        ],
                        [
                            "2015-10-13T14:16:21Z",
                            51.9893696,
                            null
                        ],
                        [
                            "2015-10-13T14:16:22Z",
                            51.9893696,
                            null
                        ],
                        [
                            "2015-10-13T14:16:23Z",
                            51.9893696,
                            5
                        ]
                    ]
                }
            ]
        }
    ]
}
beckettsean commented 8 years ago

That's an interesting use case, where one field is more densely populated than another. It might make sense to have fill(previous) in that case. Can you open a feature request https://github.com/influxdata/influxdb/issues/new describing that use case?

On Thu, Aug 25, 2016 at 1:10 PM, retorquere notifications@github.com wrote:

If I submit this however:

curl -G 'http://localhost:8086/query?pretty=true' --data-urlencode "db=mydb" --data-urlencode "q=select GPS.latitude, Weather.temperature from car where time >= '2015-10-13T14:16:13Z' limit 10"

I get this (where I'd love for there to be a way to have those nulls be replaced by 5)

{ "results": [ { "series": [ { "name": "car", "columns": [ "time", "GPS.latitude", "Weather.temperature" ], "values": [ [ "2015-10-13T14:16:14Z", 51.9893696, 5 ], [ "2015-10-13T14:16:15Z", 51.9893696, null ], [ "2015-10-13T14:16:16Z", 51.9893696, null ], [ "2015-10-13T14:16:17Z", 51.9893696, null ], [ "2015-10-13T14:16:18Z", 51.9893696, null ], [ "2015-10-13T14:16:19Z", 51.9893696, null ], [ "2015-10-13T14:16:20Z", 51.9893696, null ], [ "2015-10-13T14:16:21Z", 51.9893696, null ], [ "2015-10-13T14:16:22Z", 51.9893696, null ], [ "2015-10-13T14:16:23Z", 51.9893696, 5 ] ] } ] } ] }

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/influxdata/influxdb/issues/6878#issuecomment-242504070, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPAcegxWGsVX6UsH3lBkSNnty064F4Gks5qjei9gaJpZM4I54m9 .

Sean Beckett Director of Support and Professional Services InfluxDB

richardeaxon commented 7 years ago

+1 for this feature.

njbuch commented 7 years ago

+1 for this feature!

titilambert commented 7 years ago

Yes please ! This could be release useful !

jeremyleijssen commented 7 years ago

Any updates on this? or anyone know a workaround or something?

Currently needing this function

yellowpattern commented 7 years ago

+1 for this feature. Matter of fact, I thought "fill(previous)" and "fill(linear)" would do the job already.

solars commented 6 years ago

+1 this would be very convenient if you have sensors that report only on change

iotgopher commented 6 years ago

+1 very much needed

wonderful123 commented 6 years ago

It doesn't make sense to returns nulls with fill(previous) until it finds data within the first groupby range.

NateZimmer commented 6 years ago

Super useful for 'Change Of Value' systems. If the state rarely changes, and one looks at the query at any given time, odds are there is no datapoints within that query. Kinda cripples plotting this stuff.

solars commented 6 years ago

open since 20 Jun 2016 for such a ridiculously basic feature when talking about time series...

iotgopher commented 6 years ago

Any Time series database should have this, really a basic concept

milannnnn commented 6 years ago

Really a core function for and time series data logging. Should be implemented as soon as possible.

kevingarman commented 6 years ago

+1

philomatic commented 6 years ago

+1

This is really needed, it's a little bit ridiculously sad that there isn't even a proper workaround.

Is there a possibility to put a bounty on this?

Kortenbach commented 6 years ago

I think there are 2 cases to consider: 1- Logging is done at regular time intervals and the query covers >1 of these intervals 2- Logging is done at irregular time intervals

In case#1 if data is missing then the datasource was unavailable. Query's for that time period should probably return NIL or something similar. In case #2 if data is missing then you cannot know what is going on. Either the datasource was not available or there simply wasn't any data produced in the requested period of time.

I think that the system behaviour of case #2 should be well defined before solving this problem. How can you be sure the datasource was unavailable or not? Need a datasource health inducator (and link every series to a datasource)?

Kortenbach commented 6 years ago

I am logging data "on change", so this is a BIG issue for me. I'm directly interacting with InfluxDB, so I have some room for workarounds. Possible workarounds (I can think of): 1- Insert value into database just after starting point of query. Delete after query is done. (will this impact database performance in the long run because of repeated inserts/deletes?) 2- Shift start time of query backwards to a point where the data is available. Delete extra samples (new start upto start) from JSON result. (I personally make sure that every day at 0:00 ALL series are logged once) 3- Find and read the previous sample and use fill() to insert it (you will run into another fill issue with this approach.) 4- Like 3 but manipulate the JSON result manually to emulate the fill.

What do you use as a work-around??

yellowpattern commented 6 years ago

When you are directly interacting with the database, maybe some of the above workarounds might work but when your experience with the database is through another layer (such as grafana), some of the workarounds that you mention make no sense at all.

Kortenbach commented 6 years ago

@yellowpattern : I agree. I am directly interacting with the database. I will include that in my original post. The idea behind my post is to find out what people are using as a workaround cuz I'm really struggeling with this "bug".

macrosak commented 6 years ago

@Kortenbach we are using approach (4) which allows us to execute the last query (for finding the last value before the start of query) and the fill query at the same time.

Kortenbach commented 6 years ago

@macrosak Thank you for you reply. Good to hear that people are actually implementing workarounds! I'm currently working on approach (2). I have a checkbox that can switch the workaround on or off... I hope someone comes up with a permanent solution soon, cuz this doesn't feel right.

ronomal commented 6 years ago

I expected this to be the default behavior for fill(previous) and fill(linear). I'm querying a dozen fields where only changes are stored, the workarounds really aren't ideal, it would be great to see this feature implemented.

yellowpattern commented 6 years ago

Where I get bitten by this bug most is when I use a query that has "difference()" or "derivative()" in the query.

For queries beyond a certain time span, rather than get a delta that is ~1k, I get a first delta that is ~14 million (or rather, the query uses "0" from outside the time window, does a difference between that and the first value in the time window, gets 14,000,000.)

I'm getting so pissed off with this that I'm thinking of dropping influxdb for something else.

narciero commented 6 years ago

In reference to an answer i posted on #6967 - I think the proposed improvement to the fill() function could solve this issue with an API like:

fill(<fill_option>[,<missing_option>])

Where <missing_option> is one of:

So with fill(previous, 'all') - it would grab the last value for all tags (even if they are outside the query window).

SystemKeeper commented 5 years ago

+1

oJshua commented 5 years ago

+1

henfri commented 5 years ago

Hello,

i have the impression that this will not be fixed... -although it seems really basic. But is there a workaround?

Greetings, Hendrik

mudcat commented 5 years ago

Has there been any update to this? I am trying to display data that only gets written when it changes. Some of the values can go 12 hours without a change, so I am trying to avoid forcing it to write every minute or something since that would increase the volume of data recorded by 700 times.

wonderful123 commented 5 years ago

I had to make another query to get the last value.

On Tue, 9 Apr 2019, 10:19 pm mudcat, notifications@github.com wrote:

Has there been any update to this? I am trying to display data that only gets written when it changes. Some of the values can go 12 hours without a change, so I am trying to avoid forcing it to write every minute or something since that would increase the volume of data recorded by 700 times.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/influxdata/influxdb/issues/6878#issuecomment-481227340, or mute the thread https://github.com/notifications/unsubscribe-auth/AejRqazqw1RPyuINoiLEn7p3bw_Mzd6wks5vfIU1gaJpZM4I54m9 .

alprestedge commented 5 years ago

+1

What workarounds are folks using out of interest?

matejsp commented 5 years ago

We created a wrapper for series class that is solving a couple of quirks in influxdb. It is fixing the problematic naming last_last_last_last_measument when doing continues query. And combining the results of original and last value query. This wrapper is taking last point if series is missing data. Extra queries have of course some performance implications. On top of that we are already having performance issues with single node installation (writing 1k metrics per second).

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stafbulp commented 4 years ago

+1

Kortenbach commented 4 years ago

I still feel that this is a bug that should be fixed. Why the wontfix label?

retorquere commented 4 years ago

wontfix is just automatically applied when stalebot wakes up. Although it seems correct, descriptively, that it is applied? The issue has stood open over 3 years with no activity on the influx side.

yellowpattern commented 4 years ago

It kind of makes you wonder what sort of priority Influxdb actually give fixing significant bugs.

Kortenbach commented 4 years ago

I would like to know when this will be fixed. Anyone from influx care to join?