jhedlund commented 8 years ago

This problem just cropped up today, I am using 0.10.0 (upgraded last Tuesday).

I am missing data depending on my WHERE clause query, for example, if I do:

SELECT first(open) AS open, max(high) AS high, min(low) AS low, last(close) AS close FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)

I am currently missing data from today 15:28 UTC to 16:52 UTC (the last 84 minutes, but it's been stuck at missing data from 15:28 UTC since 15:28 UTC).

However, if I simply change the time WHERE clause from "1d" to "500m", I get all the data upto current time (16:52 UTC), eg:

SELECT first(open) AS open, max(high) AS high, min(low) AS low, last(close) AS close FROM prices."default".ohlc1m WHERE time > now() - 500m AND symbol = 'EURUSD' GROUP BY time(1m)

What could cause that problem?

jhedlund commented 8 years ago

This also happened earlier today where the first query was stuck at 12:27 UTC, but it automatically started returning the missing data about an hour or two after 12:27 UTC.

jhedlund commented 8 years ago

Now the "1d" query is returning results again for upto current (19:50 UTC). So it was stuck at 15:28 UTC for almost 4.5 hours... watching to see if it gets stuck again

Does group by perform any caching based on the time in the WHERE clause?

jhedlund commented 8 years ago

Started to happen again starting with 21:56 UTC. I don't see anything in the logs that look problematic...

e-dard commented 8 years ago

@jhedlund just to be sure, could you provide the influxd logs?

jhedlund commented 8 years ago

Here you go: influxd.log.gz

It happened again a couple times last night, currently it is reading the data correctly to the minute but I expect it to happen again ..

jhedlund commented 8 years ago

It's happening again. Playing around with the queries a bit, it's interesting: time > now() - 1d stops with data at 14:16 UTC. If I do time > now() - 4h, it has data up-to 16:18 UTC. If I do time > now() - 2h, I get up to the minute data (16:57 UTC)

jhedlund commented 8 years ago

Looking again now, the previous comment holds true. 1d still stops at 14:16 UTC, 4h still stops at 16:18 UTC, and 2h is complete up-to the latest minute (17:42 UTC). Seems like it's looking at some sort of cache that isn't being updated properly?

t-orb commented 8 years ago

I have this problem too and as in the comment above it looks like a caching thing to me too. I run queries with different time based where clauses and get different results (data missing in all but the first one usually). If I go away for a while and come back the other where clauses sometimes work fine. Go away for a long time and they all work for older data but newer data sometimes has this problem again. I can't reliably reproduce it, but when it strikes it sticks for some time. I'll have one query that reliably shows that I have all the data (usually I'll do a sum() of everything by time(5m)), but any other query at all trying to show the same data will show nothing or more frequently just a tiny fraction of the expected data. Stopping and restarting influxd makes no difference, just have to wait and eventually it sorts itself out. Nothing unusual in the logs at all.

jhedlund commented 8 years ago

Agreed on stopping/starting, that doesn't help as that was my first attempt to fix the problem.

How long have you seen the problem? I have not seen this problem until upgrading to 0.10, but it took almost a week to show itself. It ran smooth last week

jwilder commented 8 years ago

This might be upgrade issues similar to #5474.

@jhedlund @t-orb Can you both run show servers, show shards, and cat the contents of your meta/node.json file? If you could put it in a gist that would be great.

jhedlund commented 8 years ago

Here you go: https://gist.github.com/jhedlund/4b03810940da855955b9

t-orb commented 8 years ago

https://gist.github.com/t-orb/134916765f84abf1885a

I can rule out an upgrade issue (at least for me), this is a very simple and clean "apt-get" install of 0.10.

I haven't had this problem today so it may be related to loading a lot of data. Last week I was inserting about a weeks worth of data as fast I could go (a bit slower than I'd have liked), since then I'm only inserting about 200k datapoints every 5 minutes (in just 5 POSTs). I was seeing this problem after all the data had been loaded though, but there may be a lot of internal work going on for quite a while after a lot of data has been loaded.

jwilder commented 8 years ago

Doesn't appear to be an upgrade issue from both of those gists.

jhedlund commented 8 years ago

I write quite a bit of data all the time, but this problem just started happening this week. I don't think it's writing any more data than usual...

Note: I have not yet upgraded my old data to tsm yet. I had previously tried twice but I ran into an issue where a lot of data went missing (I believe to be issue #5606).

I don't think the old database would be an issue here though since the missing "cached" (?) data is all stored in the new tsm anyway.

jhedlund commented 8 years ago

I may have found a clue. I do my own rollup queries from the tick table to the minute table (the minute table is where this problem is occurring). Normally I query the last 10 minutes from the tick table and push into the minute table. I forgot that I had it set temporarily to look at the last 120 minutes this week to back fill in some missing data.

I've moved it back to 10 minutes and will check to see if that recovers and stays recovered. It could have been the overwriting of the past 120 minutes of data causing the data to not appear for some reason. (for my rollup query, the insert is always overwriting the data for the past x minutes)

jhedlund commented 8 years ago

Note: Previous comment did not do much, if anything.

jhedlund commented 8 years ago

This is continuing to cause production issues for me. Is there anything else I can do to "flush" the cache to force the data to be up to date? Reminder: Restarting influx does not help.

Really appreciate any help here, thanks.

jwilder commented 8 years ago

@jhedlund Do you have anyway for us to reproduce this locally? Can you provide some sample writes in line protocol format that your sending? #5751 is a fix for the cache related to writing duplicate points (same timestamp) for the same series that may help here.

jhedlund commented 8 years ago

Is the only way for me to test if #5751 fixes my issue to download the latest code and compile?

My points eventually do show up, but I don't know what causes them to show up. And I don't know if in #5751 this happened, but the points are all there if I use a shorter time in the WHERE clause.

Another potential difference here is that this is only showing up as missing points when I am GROUP BY, the raw data points are there if I don't GROUP BY

jwilder commented 8 years ago

@jhedlund Ok. #5751 does not seem relevant to your issue if you see the data without a group by.

When you see the points missing, does it make any difference if you ran each aggregate as separate query? So instead of:

SELECT first(open) AS open, max(high) AS high, min(low) AS low, last(close) AS close FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)

You run:

SELECT first(open) AS open FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)
SELECT max(high) AS high FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)
SELECT min(low) AS low FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)
SELECT last(close) AS close FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)

jhedlund commented 8 years ago

@jwilder I will try this the next time it's happening, which I suspect will not be until Monday at this point since the weekend is not submitting data.

I'll let you know, thanks.

jhedlund commented 8 years ago

I take that back, it is still happening.

Separate queries are not making a difference.

jhedlund commented 8 years ago

Right now with time > now() - 2d, last data is showing at 2016-02-18T12:05:00

With time > now() -1d, last data is current as of 2016-02-19T21:45:00

So with the now() - 2d, missing about 2020 minutes of data with that where clause difference.

randywallace commented 8 years ago

The same exact thing happens to me, and it seems to be related to #5751 oddly enough. I have a couple metrics this happens on that I intentionally duplicate because they are based on things happening in log files. I, every 30s, grep (and .. a few 'other' things) some odd 2M lines of logs on ~16 servers for the the last 2 minutes worth of counts of things that happened, then push them onto influxDB with time-rollup timestamps (i.e. at minute boundaries so I can easily graph how many of a log message happen per minute without fighting with InfluxQL needlessly). I expect that every single metric is duplicated (occasionally actually updated) at least once. What is odd is that the data all 'seems' to be there, just the most recent data is only viewable in a shorter timerange. I would rather not poll InfluxDB to test if a metric exists already at the correct value prior to posting it... but it does seem at least like that would fix it b/c the only metrics this happens to are the ones that I know have duplicate data. I have only started testing these new expected-duplicate metrics and so far have just been DROP MEASUREMENT ... 'ing the set to clear out this behavior.

jhedlund commented 8 years ago

@randywallace Have you tested if #5751 fixes your issue?

I am overwriting data points, and probably quite often it's duplicate data, but it's only the last 10 points that I overwrite (about every minute).

I also noticed another clue this morning. When the data goes missing, it jumps backwards instantly, it's not a gradual process. For example, this morning at 7:25 ET the data was current, then at 7:28 ET the last couple of hours of data points disappeared leaving me with the last data point of 5:39 ET.

jwilder commented 8 years ago

5751 has been merged and will be in tonights nightly.

jhedlund commented 8 years ago

Well I tried installing the nightly build but it failed to start with an unknown engine error:

[filestore] 2016/02/23 18:02:49 /var/lib/influxdb/data/_internal/monitor/16293/000000004-000000003.tsm (#0) opened in 646.672Âµs [filestore] 2016/02/23 18:02:49 /var/lib/influxdb/data/_internal/monitor/16293/000000008-000000003.tsm (#1) opened in 446.411Âµs [filestore] 2016/02/23 18:02:49 /var/lib/influxdb/data/_internal/monitor/16293/000000012-000000003.tsm (#2) opened in 479.614Âµs [filestore] 2016/02/23 18:02:49 /var/lib/influxdb/data/_internal/monitor/16293/000000014-000000002.tsm (#3) opened in 439.237Âµs [filestore] 2016/02/23 18:02:49 /var/lib/influxdb/data/_internal/monitor/16293/000000015-000000001.tsm (#4) opened in 393.18Âµs [cacheloader] 2016/02/23 18:02:49 reading file /var/lib/influxdb/wal/_internal/monitor/16293/_00046.wal, size 2191601 [cacheloader] 2016/02/23 18:02:50 reading file /var/lib/influxdb/wal/_internal/monitor/16293/_00059.wal, size 0 run: open server: open tsdb store: failed to open shard 1: new engine: unknown engine type

I assume that's some other issue going on. So I reverted back to v0.10.1 (which I then ran into #5772 but just restored the node.js property to get past that).

Should I open a new issue for the nightly failure above?

jwilder commented 8 years ago

@jhedlund That is expected for #5723. 0.11 removes support for b1 and bz1 shards. If you want to run the nightly and 0.11, you would need to convert your shards to tsm using influx_tsm.

jhedlund commented 8 years ago

@jwilder Ok, thanks, I was not aware of that.

jhedlund commented 8 years ago

I have upgraded to nightly and upon start up it has solved the problem. So perhaps #5751 has resolved this problem. I'll know more I suppose in the hours that come if it does not crop back up

jwilder commented 8 years ago

Not sure if this issue is still occurring or not but #5996 may fix this.

jwilder commented 8 years ago

Assuming this is fixed via #5996. Let us know if this issue is still occurring with latest nightlies.

influxdata / influxdb

Different query results depending on time WHERE clause #5693

5751 has been merged and will be in tonights nightly.