Closed jhorwit2 closed 9 years ago
I too am getting a timeout in continuous queries
[continuous_querier] 2015/07/27 12:46:58 timeout [continuous_querier] 2015/07/27 12:46:58 error during recompute previous: timeout. running: SELECT sum(value) INTO "connecto"."default".page_calls_pm FROM "connecto"."default".page_calls WHERE time >= '2015-07-27T12:45:00Z' AND time < '2015-07-27T12:46:00Z' GROUP BY time(1m) [continuous_querier] 2015/07/27 12:46:58 error executing query: CREATE CONTINUOUS QUERY page_calls_per_minute ON connecto BEGIN SELECT sum(value) INTO "connecto"."default".page_calls_pm FROM "connecto"."default".page_calls GROUP BY time(1m) END: err = timeout
Although in my case all the writes were anyway timing out
[http] 2015/07/27 12:46:59 23.101.29.30 - root [27/Jul/2015:12:46:54 +0000] POST /write?u=root&p=root&database=connecto&precision=ms&db=connecto HTTP/1.1 500 8 - - 8f583854-345d-11e5-9330-000000000000 5.0011198s
Yea, this is the 3rd time in 24hrs it has happened. I finally noticed it in the logs, so I opted to remove that CQ. (I updated to 0.9.2 about 24hrs ago now)
Spikes are times http write timeouts :(
Restarts seem to help just fine for a little while, but as I mentioned in #3469, I get ungraceful restarts w/ panics.
@mathurs did your write timeouts start very closely to the CQ timeout?
No, actually my writes start timing out gradually after server start. I am writing at about a 1000qps and initially the writes are fine but after a few minutes, the latency gradually increases and the writes start to timeout. Once that happens the continuous queries also start to time out, so I don't think the CQ's cause the write timeouts.
Try adjusting your wal settings
I lowered my max-wal-size
to 1048576
(1mb) and i have been able to write over 3k+ per second sustained. My points are relatively small though in size, so my requests aren't huge. You might want to tweak accordingly. `wal flushes will block writes, so a write immediately after will have slightly higher latency (nothing ot crazy from what i have seen)
Tried this also and tried reducing other load on the server, doesn't seem to help. Infact, my reads are also frozen now.
@jhorwit2 @mathurs do you have similar restart experiences as I reported in https://github.com/influxdb/influxdb/issues/3488?
@beckettsean yes. I get the same port already in use and I just have to wait for it to finally die before i can restart to clear the timeouts. I have gotten this error at least 6 times in the since Sunday.
I did notice if i removed my CQ that was doing a where tag =~ /regex/
it doesn't fail. Your CQ wouldn't happen to have a regex would it?
My CQ does not have a regex. It's quite simple:
CREATE CONTINUOUS QUERY foo ON telegraf BEGIN
SELECT mean(value) INTO cpu_idle_5m FROM cpu_idle GROUP BY time(5m), cpu
END
I am selecting from and writing to the same retention policy, which means reads and writes to the same shards. Does your CQ also write to the same RP from which it pulls data?
Yes, all my CQ and writes use my new default retention policy. (everything uses the same RP)
@mathurs are you reading from and writing to the same retention policy in your CQ that times out?
Could be we have write/read lock contention at the shard level, exacerbated by constrained resource availability.
Is there anyway to report metrics on such things currently (gonna guess no after quickly looking through the shard write code) or would it require profiling the app?
I think most of the profiling the core team does involves attaching a profiler to the running process, nothing baked in and user-facing. Unfortunately SHOW STATS
and SHOW DIAGNOSTICS
are both still waiting to be wired back up following the cluster redesign. Your improved CQ logging PR was merged and should be available in the recent nightly builds.
@dgnorton @corylanou @pauldix any recommendations for how @jhorwit2 could investigate the CQ locking & timeouts a bit more closely?
CQ timeout > 500s on writes. Database is fine for queries but all telegraf writes return 500.
[http] 2015/07/28 23:47:02 127.0.0.1 - - [28/Jul/2015:23:47:02 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf 9e6b2b72-35bd-11e5-8fb8-000000000000 2.217053ms
[http] 2015/07/28 23:47:09 127.0.0.1 - - [28/Jul/2015:23:47:09 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf a296e94b-35bd-11e5-8fb9-000000000000 2.104355ms
[http] 2015/07/28 23:47:16 127.0.0.1 - - [28/Jul/2015:23:47:16 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf a6c3fd99-35bd-11e5-8fba-000000000000 2.052723ms
[shard] 2015/07/28 23:47:20 flush 1511 points in 0.008s
[continuous_querier] 2015/07/28 23:47:25 timeout
[continuous_querier] 2015/07/28 23:47:25 error: timeout. running: SELECT mean(value) INTO "telegraf"."default".cpu_idle_5m FROM "telegraf"."default".cpu_idle WHERE time >= '2015-07-29T06:45:00Z' AND time < '2015-07-29T06:50:00Z' GROUP BY time(5m), cpu
[continuous_querier] 2015/07/28 23:47:25 error executing query: CREATE CONTINUOUS QUERY foo ON telegraf BEGIN SELECT mean(value) INTO "telegraf"."default".cpu_idle_5m FROM "telegraf"."default".cpu_idle GROUP BY time(5m), cpu END: err = timeout
[http] 2015/07/28 23:47:28 127.0.0.1 - - [28/Jul/2015:23:47:23 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf aaefed1c-35bd-11e5-8fbb-000000000000 5.007557306s
[http] 2015/07/28 23:47:35 127.0.0.1 - - [28/Jul/2015:23:47:30 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf af1b58e9-35bd-11e5-8fbc-000000000000 5.002282771s
pkill influxd
and restart with influxd -config /usr/local/etc/influxdb.conf 2>> ~/workspace/logs/influxd.log &
leads to a hard shutdown:
[http] 2015/07/29 11:37:09 127.0.0.1 - - [29/Jul/2015:11:37:04 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf cf30c9e7-3620-11e5-92ab-000000000000 5.005835919s
[run] 2015/07/29 11:37:10 signal received, initializing clean shutdown...
[run] 2015/07/29 11:37:10 waiting for clean shutdown...
[snapshot] 2015/07/29 11:37:10 snapshot listener closed
[tcp] 2015/07/29 11:37:10 cluster service accept error: network connection closed
[metastore] 2015/07/29 11:37:13 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
[metastore] 2015/07/29 11:37:13 read local node id: 1
[metastore] 2015/07/29 11:37:14 [WARN] raft: Heartbeat timeout reached, starting election
[metastore] 2015/07/29 11:37:14 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Votes needed: 1
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Vote granted. Tally: 1
[metastore] 2015/07/29 11:37:14 [INFO] raft: Election won. Tally: 1
[metastore] 2015/07/29 11:37:14 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
[metastore] 2015/07/29 11:37:14 [INFO] raft: Disabling EnableSingleNode (bootstrap)
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:14 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[http] 2015/07/29 11:37:16 127.0.0.1 - - [29/Jul/2015:11:37:11 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf d35cea29-3620-11e5-92ac-000000000000 5.000702979s
[http] 2015/07/29 11:37:23 127.0.0.1 - - [29/Jul/2015:11:37:18 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf d788f44f-3620-11e5-92ad-000000000000 5.000658076s
[snapshot] 2015/07/29 11:37:24 snapshot listener closed
[tcp] 2015/07/29 11:37:24 cluster service accept error: network connection closed
[shard-precreation] 2015/07/29 11:37:24 precreation service terminating
run: open server: open service: listen tcp :8083: bind: address already in use
[http] 2015/07/29 11:37:30 127.0.0.1 - - [29/Jul/2015:11:37:25 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf dbb4716f-3620-11e5-92ae-000000000000 5.004789863s
[http] 2015/07/29 11:37:37 127.0.0.1 - - [29/Jul/2015:11:37:32 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf dfe14191-3620-11e5-92af-000000000000 5.004511327s
[run] 2015/07/29 11:37:40 time limit reached, initializing hard shutdown
After the hard shutdown, a second restart works just fine, and writes are back.
[metastore] 2015/07/29 11:37:57 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
[metastore] 2015/07/29 11:37:57 read local node id: 1
[shard] 2015/07/29 11:37:57 flush 1247 points in 0.019s
[shard] 2015/07/29 11:37:57 flush 886 points in 0.014s
[shard] 2015/07/29 11:37:58 flush 1494 points in 0.022s
[shard] 2015/07/29 11:37:58 flush 1407 points in 0.020s
[shard] 2015/07/29 11:37:58 flush 1233 points in 0.018s
[shard] 2015/07/29 11:37:58 flush 1133 points in 0.016s
[shard] 2015/07/29 11:37:58 flush 1305 points in 0.020s
[admin] 2015/07/29 11:37:58 listening on HTTP: [::]:8083
[httpd] 2015/07/29 11:37:58 authentication enabled: false
[httpd] 2015/07/29 11:37:58 listening on HTTP: [::]:8086
2015/07/29 11:37:58 InfluxDB starting, version 0.9.2, commit 6c0a91f775f9fc0e625d17ffa04a3fe86945ba09
2015/07/29 11:37:58 GOMAXPROCS set to 8
[run] 2015/07/29 11:37:58 listening for signals
[metastore] 2015/07/29 11:37:59 [WARN] raft: Heartbeat timeout reached, starting election
[metastore] 2015/07/29 11:37:59 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Votes needed: 1
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Vote granted. Tally: 1
[metastore] 2015/07/29 11:37:59 [INFO] raft: Election won. Tally: 1
[metastore] 2015/07/29 11:37:59 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
[metastore] 2015/07/29 11:37:59 [INFO] raft: Disabling EnableSingleNode (bootstrap)
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/29 11:37:59 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/29 11:37:59 Sending anonymous usage statistics to m.influxdb.com
[http] 2015/07/29 11:38:00 127.0.0.1 - - [29/Jul/2015:11:38:00 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf f091c3b2-3620-11e5-8001-000000000000 1.811326ms
[http] 2015/07/29 11:38:07 127.0.0.1 - - [29/Jul/2015:11:38:07 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf f4bded57-3620-11e5-8002-000000000000 2.539653ms
Sorry, for the delay in my response. Yes, the restart log looks very similar. I have to cleanly shutdown first wait for it to reach hard time limit, and then restart. Except, in my case its very predictable. I restart it and within 5-10 minutes it reaches time out stage and everything freezes. @beckettsean, its not just my CQs, but all my writes are also timing out.
Does anyone notice panics during this restart process? I see this #3468
ok, I'm guessing this has to do with how the WAL flushes currently work. We're in the middle of working on a solution for this. Will hopefully make it into master early next week.
There a ticket on that? I'm curious to see the problem / fix.
@jhorwit2 There are probably a dozen different ones open on it right now. I've opened #3514 as an issue to track the work on the WAL specifically.
:+1:
Not sure if I'm experiencing the same issue or a variant (with 0.9.2). I am not using continuous queries (after https://github.com/influxdb/influxdb/issues/3362), but seem to be having the same symptoms after running a query (occasionally).
normal writes, the sun is shining, all is good:
[http] 2015/08/06 21:25:31 10.6.28.3 - - [06/Aug/2015:21:25:31 +0000] POST /write?db=metrics_production&precision=ms HTTP/1.1 204 0 - - aa4ed19c-3c81-11e5-bf00-000000000000 24.31452ms
then, the killer query (with some vars replaced with x/y) - although it's not slow, it takes only ~4ms:
[http] 2015/08/06 21:25:31 10.12.6.246 - - [06/Aug/2015:21:25:31 +0000] GET /query?db=metrics_production&q=select+mean(value)+from+%22cpu.load.value%22+where+%22zonename%22+%3d+%27y%27+and+%22account%22+%3d+%27x%27+and+%22period%22+%3d+%275min%27+and+time+%3e+now()+-+1h+GROUP+BY+time(1m)%3b%0d%0aselect+mean(value)+from+%22net.bytes.rate%22+where+%22zonename%22+%3d+%27y%27+and+%22account%22+%3d+%27x%27+and+%22op%22+%3d+%27w%27+and+time+%3e+now()+-+1h+GROUP+BY+time(1m)%3b%0d%0aselect+mean(value)+from+%22net.bytes.rate%22+where+%22zonename%22+%3d+%27y%27+and+%22account%22+%3d+%27x%27+and+%22op%22+%3d+%27r%27+and+time+%3e+now()+-+1h+GROUP+BY+time(1m)%3b%0d%0aselect+mean(value)+from+%22memory.usage.value%22+where+%22zonename%22+%3d+%27y%27+and+%22account%22+%3d+%27x%27+and+%22type%22+%3d+%27rss%27+and+time+%3e+now()+-+1h+GROUP+BY+time(1m)%3b HTTP/1.1 200 25 - - aaaef87a-3c81-11e5-bf08-000000000000 4.16173ms
everything 500's after this point, with ~5s response times (presumably a timeout of some kind):
[http] 2015/08/06 21:25:36 10.6.21.4 - - [06/Aug/2015:21:25:31 +0000] POST /write?db=metrics_production&precision=ms HTTP/1.1 500 32 - - aa51ac2e-3c81-11e5-bf01-000000000000 5.002281318s
and finally the hard shutdown part:
[http] 2015/08/06 21:26:49 10.6.20.3 - - [06/Aug/2015:21:26:44 +0000] POST /write?db=metrics_production&precision=ms HTTP/1.1 500 32 - - d5cd0dcb-3c81-
11e5-8009-000000000000 5.018538153s
[run] 2015/08/06 21:26:49 time limit reached, initializing hard shutdown
URL decoded version of the query for easier reading:
select mean(value) from "cpu.load.value" where "zonename" = 'y' and "account" = 'x' and "period" = '5min' and time > now() - 1h GROUP BY time(1m);
select mean(value) from "net.bytes.rate" where "zonename" = 'y' and "account" = 'x' and "op" = 'w' and time > now() - 1h GROUP BY time(1m);
select mean(value) from "net.bytes.rate" where "zonename" = 'y' and "account" = 'x' and "op" = 'r' and time > now() - 1h GROUP BY time(1m);
select mean(value) from "memory.usage.value" where "zonename" = 'y' and "account" = 'x' and "type" = 'rss' and time > now() - 1h GROUP BY time(1m);
@nicksellen if you issue those queries individually does it still lead to the 500s and the shutdown?
I ended up having to turn off all my CQ because of this :(. Now writes never timeout.
@jhorwit2 CQs will be a significant focus of the 0.9.4 release. There are still a lot of rough edges there.
@beckettsean ah, this could make sense if having a bunch of queries causes a timeout and triggers the same behaviour as a CQ causing a timeout - and I never had the problem with queries made from grafana (which are all individual).
It's happening less frequently now (or maybe not at all) as I am running the queries themselves less frequently (for other reasons) - the sample size would be so small I wouldn't know if the change to separate queries had made a difference. I'll try that though if it increases in frequency again, good suggestion.
And to clarify, the shutdown was always manually issued via service influxdb stop
- just that it takes a while and eventually does the hard shutdown stuff.
@nicksellen thanks for the clarification on the hard shutdown, I'm glad InfluxDB isn't committing suicide. Something about the locking behavior that leads to the slow queries and stalled writes also seems to stall the shutdown process.
I think it's less that the number of queries causes the problem, and more that the deadlocking issue becomes more likely as write volume and number of points queried rise. High enough write volume and a single query might cause the locking. Query enough points and a low write volume might still start timing out. Running four queries on different measurements covers more points than a single one, and maybe issuing them in the same statement makes the locking more likely.
In any event, this performance issue is the major focus of the 0.9.3 remaining effort, and the changes to the WAL have dramatically reduced the risk in our testing. Looking forward to your results with 0.9.3 final late next week.
I would like to share my experience on this thread, i am writing very less data 1~5 writes per second, each have 1~5 points,
I have tested with count function in cq query resulted not in success, received 500 after some time, tested without count function in cq query resulted in success, haven't 500.
it is very likely one of the reason of "500s related with the CQs issue" is because of the count(field) aggregate function?
@eminden thanks for the report. It is possible that certain aggregations, like COUNT
, are more likely to lead to 500 errors. What were your exact continuous queries? Can you show the one with the COUNT
aggregate that leads to errors, and the one without COUNT
that doesn't lead to 500 errors?
@beckettsean sorry for not returning back about what you are asking, i am not able to give you concrete steps to reproduce the same, but i did some tests that i would like to share , here are what i found ;
I think this might be solved by a couple of the PRs that are open. #3522 #3517
We might want to wait for I'm guessing 9.4 (or nightly when those get merged)
@eminden
null
for the GROUP BY interval, that the write fails? That makes sense, and I suppose would require `fill()
statement. I'm not sure there's any other workaround, since interpolating or guessing at the proper value would be wrong at least half the time. Maybe the CQ should not write that point?@jhorwit2 that's correct, 0.9.4 is slated for major CQ work, but 0.9.3 is still focused on clustering.
@beckettsean
@eminden thanks for the clarification on #1, I see what you're saying now. It's not that AS wasn't working, it's that if the user doesn't supply aliases for multiple instances of the same function, the inserted results are wrong.
Can you open a new issue for that? I don't see one currently, and we should address that in the parser or some other way in the CQs.
Brett, absolutely InfluxDB recommends using multiple CQs to downsample at multiple intervals to multiple retention policies, as appropriate for your retention plans, but I personally wouldn't call that chaining, since they run concurrently on the data, not consecutively.
By "chaining CQs" I mean aggregating raw into 1m, and then aggregating 1m into 15m, and then 15m into 1h, etc. That introduces uncertainties because each successive aggregation is summarizing a summary.
Much better to aggregate raw to 1m, raw to 15m, raw to 1h, etc., all running in parallel and using raw data as the source.
If you aren't using an aggregation function, but instead a selector like FIRST, LAST, TOP, MAX, etc., then there's no real loss of fidelity from downsampling already downsampled data, although the timestamps will get muddled until #1577 is fixed.
Created https://github.com/influxdb/influxdb/issues/3643 for the aliasing issue reported by @eminden
By your definition, then, we are chaining CQs; we are always downsampling "one step" from one retention policy into a longer one with larger bins. I was under the impression that this was a well-supported use case, based on @pauldix's previous post:
From there you can see that downsampling can happen from the raw space or from other already downsampled spaces.
The remaining confusion for me is that, without doing this, we'd have to keep raw data around for the second-longest of all our retention policies. In our case, I think it's 1-week bins for a year, so we would have to keep raw data (10-second bins) for a week, rather than for an hour as we currently do. This defeats a chunk of the space savings of having cascading downsampling.
Basically I guess I'm making a request that the race conditions get fixed. :-) I don't really think you're planning to punt that to the user, but this seems like a perfectly reasonable use case given the abilities of (continuous) queries to pull from specific retention policies.
@brettdh It is a supported use case. I didn't say it was forbidden, just that it had side effects best avoided if possible.
Your second point is very valid, you have to ensure that the sampled data persists for at least as long as the GROUP BY interval for the downsampled aggregations. In your case that will necessitate downsampling from already aggregated data.
To be perfectly clear, I'm not at all saying you cannot run a CQ on already downsampled data, I'm just saying don't do that if you don't have to.
More CQ docs are coming as part of the 0.9.4 release, stay tuned.
Awesome, thanks for clarifying. 0.9.3 and 0.9.4 look to have a bunch of things that I'm really looking forward to.
Should be solved with 0.9.3. Please reopen if that isn't the case.
I'm waiting for 9.4 probably before i try CQ again, but will do.
I just noticed a CQ timed out randomly after running fine for the last 12 or so hours since I added it.
CQ:
The logs showed only this
Everything after this was constant 5s timeouts on writes.
releaseVersion
has relatively low cardinality (about 20 or so for this query)