influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.59k stars 3.54k forks source link

[0.9.1-2] sometimes values duplicated in query #3315

Closed ccutrer closed 9 years ago

ccutrer commented 9 years ago
> SELECT value FROM "energy" WHERE circuit='solaredge' AND time > 1436805424s and time < 1436810749s
name: energy
------------
time            value
2015-07-13T16:45:00Z    -2042.2585
2015-07-13T16:45:00Z    -2042.2585
2015-07-13T17:00:00Z    -1981.1044
2015-07-13T17:00:00Z    -1981.1044
2015-07-13T17:15:00Z    -2194.5276
2015-07-13T17:15:00Z    -2194.5276
2015-07-13T17:30:00Z    -2241.2217
2015-07-13T17:30:00Z    -2241.2217
2015-07-13T17:45:00Z    -2283.4297
2015-07-13T17:45:00Z    -2283.4297
2015-07-13T18:00:00Z    -2319.878
2015-07-13T18:00:00Z    -2319.878

I'm using 0.9.2pre as of 7b815fe8aa4f10aca7f7269c6c925c84bb6586c3. After I restart influx, the query is fixed:

> SELECT value FROM "energy" WHERE circuit='solaredge' AND time > 1436805424s and time < 1436810749s
name: energy
------------
time            value
2015-07-13T16:45:00Z    -2042.2585
2015-07-13T17:00:00Z    -1981.1044
2015-07-13T17:15:00Z    -2194.5276
2015-07-13T17:30:00Z    -2241.2217
2015-07-13T17:45:00Z    -2283.4297
2015-07-13T18:00:00Z    -2319.878

But after some amount of time and other queries, is starts doubling things again. I originally noticed this because a SUM(value) was returning twice as much as I expected.

beckettsean commented 9 years ago

@ccutrer Any other ideas for how to reproduce this? Are you running a single node or a cluster? Stock config or do you have things like authentication or custom port bindings? How are you writing your data to the system?

ccutrer commented 9 years ago

single node. stock config except for authentication is enabled

ccutrer commented 9 years ago

would a log file and/or a dump of the db be helpful? it shouldn't be too large - about 40 series, most with one value per minute, for a few weeks

beckettsean commented 9 years ago

@ccutrer db dump is probably not helpful yet, but the log responses around the queries might be interesting.

beckettsean commented 9 years ago

@ccutrer what is the result of SELECT value FROM "energy" WHERE circuit='solaredge' AND time > 1436805424s and time < 1436810749s GROUP BY *?

beckettsean commented 9 years ago

We're postulating that you have multiple points with different tag sets but the same timestamp, and the query is returning all of them. That would mean this is intended behavior, although a bit unintuitive.

ccutrer commented 9 years ago
> SELECT value FROM "energy" WHERE circuit='solaredge' AND time > 1436901871s and time < 1436903850s
name: energy
------------
time                    value
2015-07-14T19:30:00Z    -1865
2015-07-14T19:30:00Z    -1865
2015-07-14T19:45:00Z    -2421.8076
2015-07-14T19:45:00Z    -2421.8076

> SELECT value FROM "energy" WHERE circuit='solaredge' AND time > 1436901871s and time < 1436903850s GROUP BY *
name: energy
tags: circuit=solaredge
time                    value
----                    -----
2015-07-14T19:30:00Z    -1865
2015-07-14T19:30:00Z    -1865
2015-07-14T19:45:00Z    -2421.8076
2015-07-14T19:45:00Z    -2421.8076

> show series
name: energy
------------
_key                            circuit
energy,circuit=a_c              a_c
energy,circuit=computer         computer
energy,circuit=dryer            dryer
energy,circuit=freezer          freezer
energy,circuit=fridge           fridge
energy,circuit=furnace          furnace
energy,circuit=hot_tub          hot_tub
energy,circuit=outside          outside
energy,circuit=shed             shed
energy,circuit=solar            solar
energy,circuit=stove            stove
energy,circuit=solaredge        solaredge

name: power
-----------
_key                    circuit
power,circuit=a_c       a_c
power,circuit=computer  computer
power,circuit=dryer     dryer
power,circuit=freezer   freezer
power,circuit=fridge    fridge
power,circuit=furnace   furnace
power,circuit=hot_tub   hot_tub
power,circuit=outside   outside
power,circuit=shed      shed
power,circuit=solar     solar
power,circuit=stove     stove
power,circuit=solaredge solaredge

name: temperature
-----------------
_key                            probe
temperature,probe=inverter      inverter

name: voltage
-------------
_key                                    circuit         form
voltage,circuit=solaredge,form=dc       solaredge       dc

what's interesting at the moment is that using the timeframe from yesterday, I'm not getting duplication, but the timeframe from today I am. Also, I'm not seeing duplication in the power measurement, only in the energy measurement. My scripts for pushing data into influx are at https://github.com/ccutrer/energy_monitoring_scripts. They are run by a cron job every 5 minutes. The log (at /var/log/influxdb/influxd.log) does not have anything in between logging the http requests (which, btw, you should both use Basic auth for the influx CLI, and also not log the password in the log even if it's in a query string).

beckettsean commented 9 years ago

@ccutrer can you share a tgz of the influx directory, with the /data, /meta, and /hh directories? Let me know how you'd prefer to transmit that and we'll make it work.

Also, https://github.com/influxdb/influxdb/issues/3004#issuecomment-114578053 implies that you might have dropped and recreated this series one or more times. Is that true? Just trying to track down the possible repro parameters.

ccutrer commented 9 years ago

I've got the dump (it was taken while influxdb wasn't running) in my dropbox. If you e-mail me, I'll send you the link to it (it's only 15M compressed). I gave up on dropping the series, and ended up dropping the entire database and recreating from scratch. Given the behavior that it works after restarting influx, but at some point in the future (with no intervening queries) the duplicates start being returned, my suspicion is how I'm sending data (re-sending some of the same points with each run of my scripts, cause my energy monitoring utils aren't always up-to-the-minute, so I don't know if I missed any on the last run) causes a duplication in some in-memory cache, and on disk.

The only other weirdness that might be related is I had one point (in the power measurement, for circuit=hot_tub series) that my monitor returned a bogus (extremely high negative value) that I was manually overwriting to set to 0, I accidentally didn't include the .0 to make it a float, so it was written as an integer. But I also incorrectly set the precision, so it was interpreted as a very small value near the epoch. I then correct to a float, and corrected the precision, and successfully overwrote the bogus point. But given that it's for the power measurement, and I'm not seeing any duplications with power, I don't think it's related. The only other thing that changed at the time this issue started is I updated from a 0.9.1pre to a 0.9.2pre build.

ccutrer commented 9 years ago

A bit more information - I can confirm that after a restart of influxdb, there is no duplication. When I run push_ted_to_influx.rb, the energy,circuit=outside series starts showing duplication (but not energy,circuit=solaredge, which is not populated from push_ted_to_influx.rb). If I run it again immediately, Influx returns a 500 with a body of "timeout" for the first batch, and 204 for the second batch. The log line for that 500 is:

[http] 2015/07/15 21:17:28 ::1 - ted [15/Jul/2015:21:17:23 -0600] POST /write?db=ccutrer_home&precision=s HTTP/1.1 500 32 - Ruby 2cfb9606-2b69-11e5-801b-000000000000 5.05570578s

The only other log lines on each side is my query above, and a 204 for the next successful batch.

Then I run push_solaredge_to_influx.rb, and the solaredge series starts returning duplication.

ccutrer commented 9 years ago

Ahha! I just realized my power series are also getting duplicated. I just didn't see it before because I'm graphing the mean of the value, not the sum.

Oliv4945 commented 9 years ago

Hi, following advice of Sean, I post here because I have the same problem but with the mean() function. In my case restarting InfluxDB does not have any impact. Stock configuration but

  retention-check-period = "10m0s"
  retention-create-period = "45m0s"
> SHOW CONTINUOUS QUERIES
name: db_note
-------------
name            query
downSampling_10m    CREATE CONTINUOUS QUERY downSampling_10m ON db_note BEGIN SELECT mean(value) AS value INTO "db_note"."ret_7d_10m".sensorsData FROM "db_note"."ret_6h".sensorsData GROUP BY time(10m), * END
downSampling_1h     CREATE CONTINUOUS QUERY downSampling_1h ON db_note BEGIN SELECT mean(value) INTO "db_note"."ret_31d_1h".sensorsData FROM "db_note"."ret_7d_10m".sensorsData GROUP BY time(1h), * END
> SELECT * FROM ret_7d_10m.sensorsData WHERE data='Humidity' and room='Bedroom'
name: sensorsData
tags: data=Humidity, room=Bedroom
time            mean    value
----            ----    -----
2015-07-21T21:50:00Z        60.62222222222222
2015-07-21T21:50:00Z        60.62222222222222
2015-07-21T22:00:00Z        60.05
2015-07-21T22:00:00Z        60.05
2015-07-21T22:10:00Z        60.175
2015-07-21T22:10:00Z        59.849999999999994
2015-07-21T22:20:00Z        59.55

See the values at 2015-07-21T22:10:00Z, not the same. Might be just relate to float value ?

Do you need anything else ?

beckettsean commented 9 years ago

Another report of data duplication in #3419, same basic symptoms.

beckettsean commented 9 years ago

Also possible duplication in #3373

jhedlund commented 9 years ago

With my issue I reported in #3419, I've stumbled on a clue: I am overwriting the data with the same timestamps for down sampling (therefore CQ's would cause this issue as well), the duplicate data looks like it eventually removes itself (in about 15 minutes or so).

Does an overwrite just flag the old data to be deleted instead of overwriting it?

pauldix commented 9 years ago

Ah, I see what this is. It is because you're posting a data point twice and we have an in memory cache associated with the WAL. Basically, points get written to the WAL and cached in memory. Periodically, these will get flushed to the index on disk.

If you write in points with the same timestamp, those get written to the WAL and in-memory cache first. If you then query, you'll get the union of the on disk index and the WAL.

We just need to update the logic so that if there are any data points with the exact same timestamp in a series in both the index and the WAL, we only use the one from the WAL in the resulting return.

@benbjohnson, can you have a look at this one?

benbjohnson commented 9 years ago

@pauldix Sure, I'll get it fixed.

benbjohnson commented 9 years ago

Fixed: https://github.com/influxdb/influxdb/pull/3482.

Oliv4945 commented 9 years ago

Hi, I just updated to f994a97 and I still have the problem. In my case, I double checked, I am not posting my points twice.

Also, it seems that data are not doubled before having this kind of entries in the log : "[v1] 2015/07/28 14:55:38 flush 7 points in 0.017s". What does it means ?

benbjohnson commented 9 years ago

@Oliv4945 Flushing is the process of writing the points in the write ahead log (WAL) to an indexed format. Writing to the indexed format is slow so we batch up writes in the write ahead log first and keep a buffer of those in-memory.

I'm not sure how you'd see duplicates after a flush if they weren't being written twice. Do you have a script that can reproduce your issue?

otoolep commented 9 years ago

This may have been fixed by https://github.com/influxdb/influxdb/pull/3626.

Can you try a nightly build @ccutrer and see if you still have issues?

https://influxdb.com/download/index.html

ccutrer commented 9 years ago

I'm on armhf so can't use a nightly. I built 9b964f93cc8cf6914eb1c5a2ca8a044f393b419d (I use package.sh to generate my deb), and I can't verify this bug is fixed because once I start influx, any requests just hang, with nothing in the log. I've rolled back to my 0.9.2pre deb

otoolep commented 9 years ago

@benbjohnson -- I will close this. Re-open if you think it's a different issue.

Oliv4945 commented 8 years ago

Sorry @benbjohnson I did not had time to give it, but I just updated to 82c50bb and I do not have the issue anymore. Thank you !

ryan-williams commented 8 years ago

I was just able to insert some duplicates on a build from yesterday's HEAD with the tsm1 engine:

> select * from stage where start='false'
name: stage
-----------
time                    id      start   app     value
1445816603000000000     0.0     false   0465    1
1445816603000000000     0.0     false   0465    1
1445816937000000000     1.0     false   0465    1
1445816937000000000     1.0     false   0465    1
1445818026000000000     3.0     false   0465    1
1445818026000000000     3.0     false   0465    1
1445818318000000000     4.0     false   0465    1
1445818318000000000     4.0     false   0465    1
1445819466000000000     7.0     false   0465    1
1445819466000000000     7.0     false   0465    1
1445819765000000000     8.0     false   0465    1
1445819765000000000     8.0     false   0465    1

I sent basically the same data to Influx via a curl several times while testing a script that generates the data in question. This one measurement ended up with dupes but several others (all of them, afaict) did not.

beckettsean commented 8 years ago

@ryan-williams if you wait ~1 hour or bounce the process are the values still duplicated? It might be limited to the in-RAM representation and clear up once the points are fully flushed to disk.