jhorwit2 commented 9 years ago

I just noticed a CQ timed out randomly after running fine for the last 12 or so hours since I added it.

CQ:

CREATE CONTINUOUS QUERY lo_requests_revs ON database BEGIN select count(PARAMS.uid) into lo_requests_revs from "300lo" group by time(1m), releaseVersion END

The logs showed only this

[continuous_querier] 2015/07/27 08:51:56 timeout
[continuous_querier] 2015/07/27 08:51:56 error: timeout. running: SELECT count(PARAMS.uid) INTO "database"."week_policy".lo_requests_revs FROM "database"."week_policy"."300lo" WHERE time >= '2015-07-27T12:51:00Z' AND time < '2015-07-27T12:52:00Z' GROUP BY time(1m), releaseVersion
[continuous_querier] 2015/07/27 08:51:56 error executing query: CREATE CONTINUOUS QUERY lo_requests_revs ON database BEGIN SELECT count(PARAMS.uid) INTO "database"."week_policy".lo_requests_revs FROM "database"."week_policy"."300lo" GROUP BY time(1m), releaseVersion END: err = timeout
[http] 2015/07/27 08:51:57 10.15.5.228 - - [27/Jul/2015:08:51:52 -0400] POST /write?db=database&precision=ms&rp= HTTP/1.1 500 32 - Apache-HttpClient/4.4.1 (Java/1.8.0_45) 40cd5308-345e-11e5-9c0b-000000000000 5.003602351s
[shard] 2015/07/27 08:51:57 flush 2 points in 0.027s

Everything after this was constant 5s timeouts on writes.

releaseVersion has relatively low cardinality (about 20 or so for this query)

mathurs commented 9 years ago

I too am getting a timeout in continuous queries

[continuous_querier] 2015/07/27 12:46:58 timeout [continuous_querier] 2015/07/27 12:46:58 error during recompute previous: timeout. running: SELECT sum(value) INTO "connecto"."default".page_calls_pm FROM "connecto"."default".page_calls WHERE time >= '2015-07-27T12:45:00Z' AND time < '2015-07-27T12:46:00Z' GROUP BY time(1m) [continuous_querier] 2015/07/27 12:46:58 error executing query: CREATE CONTINUOUS QUERY page_calls_per_minute ON connecto BEGIN SELECT sum(value) INTO "connecto"."default".page_calls_pm FROM "connecto"."default".page_calls GROUP BY time(1m) END: err = timeout

Although in my case all the writes were anyway timing out

[http] 2015/07/27 12:46:59 23.101.29.30 - root [27/Jul/2015:12:46:54 +0000] POST /write?u=root&p=root&database=connecto&precision=ms&db=connecto HTTP/1.1 500 8 - - 8f583854-345d-11e5-9330-000000000000 5.0011198s

jhorwit2 commented 9 years ago

Yea, this is the 3rd time in 24hrs it has happened. I finally noticed it in the logs, so I opted to remove that CQ. (I updated to 0.9.2 about 24hrs ago now)

Spikes are times http write timeouts :(

Restarts seem to help just fine for a little while, but as I mentioned in #3469, I get ungraceful restarts w/ panics.

jhorwit2 commented 9 years ago

@mathurs did your write timeouts start very closely to the CQ timeout?

mathurs commented 9 years ago

No, actually my writes start timing out gradually after server start. I am writing at about a 1000qps and initially the writes are fine but after a few minutes, the latency gradually increases and the writes start to timeout. Once that happens the continuous queries also start to time out, so I don't think the CQ's cause the write timeouts.

jhorwit2 commented 9 years ago

Try adjusting your wal settings

I lowered my max-wal-size to 1048576 (1mb) and i have been able to write over 3k+ per second sustained. My points are relatively small though in size, so my requests aren't huge. You might want to tweak accordingly. `wal flushes will block writes, so a write immediately after will have slightly higher latency (nothing ot crazy from what i have seen)

mathurs commented 9 years ago

Tried this also and tried reducing other load on the server, doesn't seem to help. Infact, my reads are also frozen now.

beckettsean commented 9 years ago

@jhorwit2 @mathurs do you have similar restart experiences as I reported in https://github.com/influxdb/influxdb/issues/3488?

jhorwit2 commented 9 years ago

@beckettsean yes. I get the same port already in use and I just have to wait for it to finally die before i can restart to clear the timeouts. I have gotten this error at least 6 times in the since Sunday.

I did notice if i removed my CQ that was doing a where tag =~ /regex/ it doesn't fail. Your CQ wouldn't happen to have a regex would it?

beckettsean commented 9 years ago

My CQ does not have a regex. It's quite simple:

CREATE CONTINUOUS QUERY foo ON telegraf BEGIN 
   SELECT mean(value) INTO cpu_idle_5m FROM cpu_idle GROUP BY time(5m), cpu 
END

I am selecting from and writing to the same retention policy, which means reads and writes to the same shards. Does your CQ also write to the same RP from which it pulls data?

jhorwit2 commented 9 years ago

Yes, all my CQ and writes use my new default retention policy. (everything uses the same RP)

beckettsean commented 9 years ago

@mathurs are you reading from and writing to the same retention policy in your CQ that times out?

Could be we have write/read lock contention at the shard level, exacerbated by constrained resource availability.

jhorwit2 commented 9 years ago

Is there anyway to report metrics on such things currently (gonna guess no after quickly looking through the shard write code) or would it require profiling the app?

beckettsean commented 9 years ago

I think most of the profiling the core team does involves attaching a profiler to the running process, nothing baked in and user-facing. Unfortunately SHOW STATS and SHOW DIAGNOSTICS are both still waiting to be wired back up following the cluster redesign. Your improved CQ logging PR was merged and should be available in the recent nightly builds.

@dgnorton @corylanou @pauldix any recommendations for how @jhorwit2 could investigate the CQ locking & timeouts a bit more closely?

beckettsean commented 9 years ago

3488 has another case of this happening. Same experience last night, my CQ timed out just after a WAL flush and all writes were 500s until two restarts later:

CQ timeout > 500s on writes. Database is fine for queries but all telegraf writes return 500.

[http] 2015/07/28 23:47:02 127.0.0.1 - - [28/Jul/2015:23:47:02 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf 9e6b2b72-35bd-11e5-8fb8-000000000000 2.217053ms
[http] 2015/07/28 23:47:09 127.0.0.1 - - [28/Jul/2015:23:47:09 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf a296e94b-35bd-11e5-8fb9-000000000000 2.104355ms
[http] 2015/07/28 23:47:16 127.0.0.1 - - [28/Jul/2015:23:47:16 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf a6c3fd99-35bd-11e5-8fba-000000000000 2.052723ms
[shard] 2015/07/28 23:47:20 flush 1511 points in 0.008s
[continuous_querier] 2015/07/28 23:47:25 timeout
[continuous_querier] 2015/07/28 23:47:25 error: timeout. running: SELECT mean(value) INTO "telegraf"."default".cpu_idle_5m FROM "telegraf"."default".cpu_idle WHERE time >= '2015-07-29T06:45:00Z' AND time < '2015-07-29T06:50:00Z' GROUP BY time(5m), cpu
[continuous_querier] 2015/07/28 23:47:25 error executing query: CREATE CONTINUOUS QUERY foo ON telegraf BEGIN SELECT mean(value) INTO "telegraf"."default".cpu_idle_5m FROM "telegraf"."default".cpu_idle GROUP BY time(5m), cpu END: err = timeout
[http] 2015/07/28 23:47:28 127.0.0.1 - - [28/Jul/2015:23:47:23 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf aaefed1c-35bd-11e5-8fbb-000000000000 5.007557306s
[http] 2015/07/28 23:47:35 127.0.0.1 - - [28/Jul/2015:23:47:30 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf af1b58e9-35bd-11e5-8fbc-000000000000 5.002282771s

pkill influxd and restart with influxd -config /usr/local/etc/influxdb.conf 2>> ~/workspace/logs/influxd.log & leads to a hard shutdown:

[http] 2015/07/29 11:37:09 127.0.0.1 - - [29/Jul/2015:11:37:04 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf cf30c9e7-3620-11e5-92ab-000000000000 5.005835919s
[run] 2015/07/29 11:37:10 signal received, initializing clean shutdown...
[run] 2015/07/29 11:37:10 waiting for clean shutdown...
[snapshot] 2015/07/29 11:37:10 snapshot listener closed
[tcp] 2015/07/29 11:37:10 cluster service accept error: network connection closed
[metastore] 2015/07/29 11:37:13 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
[metastore] 2015/07/29 11:37:13 read local node id: 1
[metastore] 2015/07/29 11:37:14 [WARN] raft: Heartbeat timeout reached, starting election
[metastore] 2015/07/29 11:37:14 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Votes needed: 1
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Vote granted. Tally: 1
[metastore] 2015/07/29 11:37:14 [INFO] raft: Election won. Tally: 1
[metastore] 2015/07/29 11:37:14 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
[metastore] 2015/07/29 11:37:14 [INFO] raft: Disabling EnableSingleNode (bootstrap)
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[http] 2015/07/29 11:37:16 127.0.0.1 - - [29/Jul/2015:11:37:11 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf d35cea29-3620-11e5-92ac-000000000000 5.000702979s
[http] 2015/07/29 11:37:23 127.0.0.1 - - [29/Jul/2015:11:37:18 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf d788f44f-3620-11e5-92ad-000000000000 5.000658076s
[snapshot] 2015/07/29 11:37:24 snapshot listener closed
[tcp] 2015/07/29 11:37:24 cluster service accept error: network connection closed
[shard-precreation] 2015/07/29 11:37:24 precreation service terminating
run: open server: open service: listen tcp :8083: bind: address already in use
[http] 2015/07/29 11:37:30 127.0.0.1 - - [29/Jul/2015:11:37:25 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf dbb4716f-3620-11e5-92ae-000000000000 5.004789863s
[http] 2015/07/29 11:37:37 127.0.0.1 - - [29/Jul/2015:11:37:32 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf dfe14191-3620-11e5-92af-000000000000 5.004511327s
[run] 2015/07/29 11:37:40 time limit reached, initializing hard shutdown

After the hard shutdown, a second restart works just fine, and writes are back.

[metastore] 2015/07/29 11:37:57 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
[metastore] 2015/07/29 11:37:57 read local node id: 1
[shard] 2015/07/29 11:37:57 flush 1247 points in 0.019s
[shard] 2015/07/29 11:37:57 flush 886 points in 0.014s
[shard] 2015/07/29 11:37:58 flush 1494 points in 0.022s
[shard] 2015/07/29 11:37:58 flush 1407 points in 0.020s
[shard] 2015/07/29 11:37:58 flush 1233 points in 0.018s
[shard] 2015/07/29 11:37:58 flush 1133 points in 0.016s
[shard] 2015/07/29 11:37:58 flush 1305 points in 0.020s
[admin] 2015/07/29 11:37:58 listening on HTTP: [::]:8083
[httpd] 2015/07/29 11:37:58 authentication enabled: false
[httpd] 2015/07/29 11:37:58 listening on HTTP: [::]:8086
2015/07/29 11:37:58 InfluxDB starting, version 0.9.2, commit 6c0a91f775f9fc0e625d17ffa04a3fe86945ba09
2015/07/29 11:37:58 GOMAXPROCS set to 8
[run] 2015/07/29 11:37:58 listening for signals
[metastore] 2015/07/29 11:37:59 [WARN] raft: Heartbeat timeout reached, starting election
[metastore] 2015/07/29 11:37:59 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Votes needed: 1
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Vote granted. Tally: 1
[metastore] 2015/07/29 11:37:59 [INFO] raft: Election won. Tally: 1
[metastore] 2015/07/29 11:37:59 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
[metastore] 2015/07/29 11:37:59 [INFO] raft: Disabling EnableSingleNode (bootstrap)
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/29 11:37:59 Sending anonymous usage statistics to m.influxdb.com
[http] 2015/07/29 11:38:00 127.0.0.1 - - [29/Jul/2015:11:38:00 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf f091c3b2-3620-11e5-8001-000000000000 1.811326ms
[http] 2015/07/29 11:38:07 127.0.0.1 - - [29/Jul/2015:11:38:07 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf f4bded57-3620-11e5-8002-000000000000 2.539653ms

mathurs commented 9 years ago

Sorry, for the delay in my response. Yes, the restart log looks very similar. I have to cleanly shutdown first wait for it to reach hard time limit, and then restart. Except, in my case its very predictable. I restart it and within 5-10 minutes it reaches time out stage and everything freezes. @beckettsean, its not just my CQs, but all my writes are also timing out.

jhorwit2 commented 9 years ago

Does anyone notice panics during this restart process? I see this #3468

pauldix commented 9 years ago

ok, I'm guessing this has to do with how the WAL flushes currently work. We're in the middle of working on a solution for this. Will hopefully make it into master early next week.

jhorwit2 commented 9 years ago

There a ticket on that? I'm curious to see the problem / fix.

pauldix commented 9 years ago

@jhorwit2 There are probably a dozen different ones open on it right now. I've opened #3514 as an issue to track the work on the WAL specifically.

jhorwit2 commented 9 years ago

:+1:

nicksellen commented 9 years ago

Not sure if I'm experiencing the same issue or a variant (with 0.9.2). I am not using continuous queries (after https://github.com/influxdb/influxdb/issues/3362), but seem to be having the same symptoms after running a query (occasionally).

normal writes, the sun is shining, all is good:

[http] 2015/08/06 21:25:31 10.6.28.3 - - [06/Aug/2015:21:25:31 +0000] POST /write?db=metrics_production&precision=ms HTTP/1.1 204 0 - - aa4ed19c-3c81-11e5-bf00-000000000000 24.31452ms

then, the killer query (with some vars replaced with x/y) - although it's not slow, it takes only ~4ms:

[http] 2015/08/06 21:25:31 10.12.6.246 - - [06/Aug/2015:21:25:31 +0000] GET /query?db=metrics_production&q=select+mean(value)+from+%22cpu.load.value%22+where+%22zonename%22+%3d+%27y%27+and+%22account%22+%3d+%27x%27+and+%22period%22+%3d+%275min%27+and+time+%3e+now()+-+1h+GROUP+BY+time(1m)%3b%0d%0aselect+mean(value)+from+%22net.bytes.rate%22+where+%22zonename%22+%3d+%27y%27+and+%22account%22+%3d+%27x%27+and+%22op%22+%3d+%27w%27+and+time+%3e+now()+-+1h+GROUP+BY+time(1m)%3b%0d%0aselect+mean(value)+from+%22net.bytes.rate%22+where+%22zonename%22+%3d+%27y%27+and+%22account%22+%3d+%27x%27+and+%22op%22+%3d+%27r%27+and+time+%3e+now()+-+1h+GROUP+BY+time(1m)%3b%0d%0aselect+mean(value)+from+%22memory.usage.value%22+where+%22zonename%22+%3d+%27y%27+and+%22account%22+%3d+%27x%27+and+%22type%22+%3d+%27rss%27+and+time+%3e+now()+-+1h+GROUP+BY+time(1m)%3b HTTP/1.1 200 25 - - aaaef87a-3c81-11e5-bf08-000000000000 4.16173ms

everything 500's after this point, with ~5s response times (presumably a timeout of some kind):

[http] 2015/08/06 21:25:36 10.6.21.4 - - [06/Aug/2015:21:25:31 +0000] POST /write?db=metrics_production&precision=ms HTTP/1.1 500 32 - - aa51ac2e-3c81-11e5-bf01-000000000000 5.002281318s

and finally the hard shutdown part:

[http] 2015/08/06 21:26:49 10.6.20.3 - - [06/Aug/2015:21:26:44 +0000] POST /write?db=metrics_production&precision=ms HTTP/1.1 500 32 - - d5cd0dcb-3c81-
11e5-8009-000000000000 5.018538153s
[run] 2015/08/06 21:26:49 time limit reached, initializing hard shutdown

beckettsean commented 9 years ago

URL decoded version of the query for easier reading:

select mean(value) from "cpu.load.value" where "zonename" = 'y' and "account" = 'x' and "period" = '5min' and time > now() - 1h GROUP BY time(1m);
select mean(value) from "net.bytes.rate" where "zonename" = 'y' and "account" = 'x' and "op" = 'w' and time > now() - 1h GROUP BY time(1m);
select mean(value) from "net.bytes.rate" where "zonename" = 'y' and "account" = 'x' and "op" = 'r' and time > now() - 1h GROUP BY time(1m);
select mean(value) from "memory.usage.value" where "zonename" = 'y' and "account" = 'x' and "type" = 'rss' and time > now() - 1h GROUP BY time(1m);

@nicksellen if you issue those queries individually does it still lead to the 500s and the shutdown?

jhorwit2 commented 9 years ago

I ended up having to turn off all my CQ because of this :(. Now writes never timeout.

beckettsean commented 9 years ago

@jhorwit2 CQs will be a significant focus of the 0.9.4 release. There are still a lot of rough edges there.

nicksellen commented 9 years ago

@beckettsean ah, this could make sense if having a bunch of queries causes a timeout and triggers the same behaviour as a CQ causing a timeout - and I never had the problem with queries made from grafana (which are all individual).

It's happening less frequently now (or maybe not at all) as I am running the queries themselves less frequently (for other reasons) - the sample size would be so small I wouldn't know if the change to separate queries had made a difference. I'll try that though if it increases in frequency again, good suggestion.

And to clarify, the shutdown was always manually issued via service influxdb stop - just that it takes a while and eventually does the hard shutdown stuff.

beckettsean commented 9 years ago

@nicksellen thanks for the clarification on the hard shutdown, I'm glad InfluxDB isn't committing suicide. Something about the locking behavior that leads to the slow queries and stalled writes also seems to stall the shutdown process.

I think it's less that the number of queries causes the problem, and more that the deadlocking issue becomes more likely as write volume and number of points queried rise. High enough write volume and a single query might cause the locking. Query enough points and a low write volume might still start timing out. Running four queries on different measurements covers more points than a single one, and maybe issuing them in the same statement makes the locking more likely.

In any event, this performance issue is the major focus of the 0.9.3 remaining effort, and the changes to the WAL have dramatically reduced the risk in our testing. Looking forward to your results with 0.9.3 final late next week.

eminden commented 9 years ago

I would like to share my experience on this thread, i am writing very less data 1~5 writes per second, each have 1~5 points,

I have tested with count function in cq query resulted not in success, received 500 after some time, tested without count function in cq query resulted in success, haven't 500.

it is very likely one of the reason of "500s related with the CQs issue" is because of the count(field) aggregate function?

beckettsean commented 9 years ago

@eminden thanks for the report. It is possible that certain aggregations, like COUNT, are more likely to lead to 500 errors. What were your exact continuous queries? Can you show the one with the COUNT aggregate that leads to errors, and the one without COUNT that doesn't lead to 500 errors?

eminden commented 9 years ago

@beckettsean sorry for not returning back about what you are asking, i am not able to give you concrete steps to reproduce the same, but i did some tests that i would like to share , here are what i found ;

do not use multiple same aggregation function without "AS" statement, this causes only one field on the destination measurement. for example do not create a cq query like, create continuous query test on testdb begin select sum(field1), sum(field2) form measurement end instead, create continuous query test on testdb begin select sum(field1) as field1sum, sum(field2) as field2sum form measurement end
if some fields have null values, use fill(), otherwise this is causing 500s on writes.
chained cqs (raw to 5min, 5min to 1hour) have some problem, i have seen same 500s on writes issue when there are two cq chained.

jhorwit2 commented 9 years ago

I think this might be solved by a couple of the PRs that are open. #3522 #3517

We might want to wait for I'm guessing 9.4 (or nightly when those get merged)

beckettsean commented 9 years ago

@eminden

is fixed by this PR, will be in 0.9.3: https://github.com/influxdb/influxdb/pull/3564
I'm not sure I understand, are you saying that when running a CQ where the aggregations come up null for the GROUP BY interval, that the write fails? That makes sense, and I suppose would require `fill() statement. I'm not sure there's any other workaround, since interpolating or guessing at the proper value would be wrong at least half the time. Maybe the CQ should not write that point?
It's best not to chain CQs, but to run all the aggregations from the raw data. E.g. raw to 5min, raw to 1hour, etc. The end results are the same, but it prevents data race issues with the CQs and also eliminates the errors from summarizing a summary.

beckettsean commented 9 years ago

@jhorwit2 that's correct, 0.9.4 is slated for major CQ work, but 0.9.3 is still focused on clustering.

eminden commented 9 years ago

@beckettsean

i was testing with the yesterdays nightly build. seems what i summarised above has to be taken care by the user or influx has to stop accepting the faulty create cq command.
my case was like raw data was inserted with null values and cq was summing up on these fields, it is problematic without fill()
I suppose the performance will be degraded in case always querying in raw data, right? I am trying to understand how cqs are working internally, i suppose i have seen that every cq is running with the same interval(which the interval is defined in the config file, i guess default is 2m0s) and inserting aggregated data to destination measurement regardless of the duration defined in the group by time() , is this correct?

beckettsean commented 9 years ago

@eminden thanks for the clarification on #1, I see what you're saying now. It's not that AS wasn't working, it's that if the user doesn't supply aliases for multiple instances of the same function, the inserted results are wrong.

Can you open a new issue for that? I don't see one currently, and we should address that in the parser or some other way in the CQs.

brettdh commented 9 years ago

@beckettsean

It's best not to chain CQs

This is a surprising statement, given prior conversations here, here, and here.

Are you only referring to chaining CQs into the same retention policy being a bad idea? We rely heavily on chaining CQs from short retention policies to longer ones.

beckettsean commented 9 years ago

Brett, absolutely InfluxDB recommends using multiple CQs to downsample at multiple intervals to multiple retention policies, as appropriate for your retention plans, but I personally wouldn't call that chaining, since they run concurrently on the data, not consecutively.

By "chaining CQs" I mean aggregating raw into 1m, and then aggregating 1m into 15m, and then 15m into 1h, etc. That introduces uncertainties because each successive aggregation is summarizing a summary.

Much better to aggregate raw to 1m, raw to 15m, raw to 1h, etc., all running in parallel and using raw data as the source.

If you aren't using an aggregation function, but instead a selector like FIRST, LAST, TOP, MAX, etc., then there's no real loss of fidelity from downsampling already downsampled data, although the timestamps will get muddled until #1577 is fixed.

beckettsean commented 9 years ago

Created https://github.com/influxdb/influxdb/issues/3643 for the aliasing issue reported by @eminden

brettdh commented 9 years ago

By your definition, then, we are chaining CQs; we are always downsampling "one step" from one retention policy into a longer one with larger bins. I was under the impression that this was a well-supported use case, based on @pauldix's previous post:

From there you can see that downsampling can happen from the raw space or from other already downsampled spaces.

The remaining confusion for me is that, without doing this, we'd have to keep raw data around for the second-longest of all our retention policies. In our case, I think it's 1-week bins for a year, so we would have to keep raw data (10-second bins) for a week, rather than for an hour as we currently do. This defeats a chunk of the space savings of having cascading downsampling.

Basically I guess I'm making a request that the race conditions get fixed. :-) I don't really think you're planning to punt that to the user, but this seems like a perfectly reasonable use case given the abilities of (continuous) queries to pull from specific retention policies.

beckettsean commented 9 years ago

@brettdh It is a supported use case. I didn't say it was forbidden, just that it had side effects best avoided if possible.

Your second point is very valid, you have to ensure that the sampled data persists for at least as long as the GROUP BY interval for the downsampled aggregations. In your case that will necessitate downsampling from already aggregated data.

To be perfectly clear, I'm not at all saying you cannot run a CQ on already downsampled data, I'm just saying don't do that if you don't have to.

More CQ docs are coming as part of the 0.9.4 release, stay tuned.

brettdh commented 9 years ago

Awesome, thanks for clarifying. 0.9.3 and 0.9.4 look to have a bunch of things that I'm really looking forward to.

beckettsean commented 9 years ago

Should be solved with 0.9.3. Please reopen if that isn't the case.

jhorwit2 commented 9 years ago

I'm waiting for 9.4 probably before i try CQ again, but will do.

influxdata / influxdb

[0.9.2] CQ timeout caused http write timeouts. #3469

3488 has another case of this happening. Same experience last night, my CQ timed out just after a WAL flush and all writes were 500s until two restarts later: