pyr / cyanite

cyanite stores your metrics
http://cyanite.io
Other
446 stars 79 forks source link

segment search regex leads to timeouts due to large return #246

Open tehlers320 opened 7 years ago

tehlers320 commented 7 years ago

To reproduce Grafana or graphite-api search: prod.us-west-2.collectd_metrics.*.*.disk-*.disk_ops.write

Cyanite will query this: SELECT * from segment WHERE pos = 8 AND segment LIKE 'prod.us-west-2.collectd_metrics.%' ALLOW FILTERING;

This data contains many metrics that we do not need for this query for example:

prod.us-west-2.collectd_metrics.caps-competition-api-web.9b95f0-ip-10-1-1-247.interface-eth0.if_octets

cqlsh -k metric -e "SELECT * from segment WHERE pos = 8 AND segment LIKE 'prod.us-west-2.collectd_metrics.%'  ALLOW FILTERING;"

<stdin>:1:errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1

I tried to query for this but it doesn't seem supported in cassandra. SELECT * from segment WHERE pos = 8 AND segment LIKE 'prod.us-west-2.collectd_metrics.%.disk_ops.write' ALLOW FILTERING;

Should cyanite consider using tags and i wonder how much this would grow the segment table size by ?

tags { "prod", "us-west-2", "collectd_metrics", "cyanite-cassandra", "ip-10-1-1-23", "disk-xvdf", "disk_ops", "write" }

Then the query could do this based on any '.' entry that is not a regex

i did not test this... SELECT * FROM segment WHERE tags CONTAINS 'prod' AND tags CONTAINS 'us-west-2' AND tags 'collectd_metrics' AND tags CONTAINS 'disk_ops' AND tags CONTAINS 'write';

ifesdjeen commented 7 years ago

I'm afraid the query you mention will work even worse than the current implementation. What Cassandra does is it picks the most selective index and queries it, filtering out the rest of results, even if there are more indexes available.

Although after checking your query I've noticed two things:

  1. We can use a different tokenizer (currently we're using the one that would just split results letter by letter. Since we do not require such level of detail, we can use a tokenizer that would just split words, which may result into smaller trees and better traversal.
  2. We can do contains query and do less post-filtering in the case you indicated.

I'm going to start with (2) right after I'm done with #244

ifesdjeen commented 7 years ago

I've tested an alternative tokeniser and I have good news: we most likely will be able to (yet again) significantly improve the performance.

I'll still have to modify it to support disk-* form of queries, right now it only supports full segment skip (like *), but in general it turns out we still can improve.

ifesdjeen commented 7 years ago

Tokeniser impl (prototype) can be found in #249.

tehlers320 commented 7 years ago

I pulled in 248 and 249 to our test environment. Truncated my segment and metric table. Searching the tree seems snappy (but always does when i truncate). Will let it re-populate over a day or so. I am not able to retrieve metrics though.

our "query" host is spamming this:

ERROR [2016-10-03 21:26:35,953] epollEventLoopGroup-3-1 - io.cyanite.api could not process request
clojure.lang.ArityException: Wrong number of args (2) passed to: index/fn--6086/G--6068--6095
ifesdjeen commented 7 years ago

@tehlers320 do you have a full stack trace?..

ifesdjeen commented 7 years ago

@tehlers320 I've found the reason and pushed the fix to #248. Was incorrect arity usage from my side...

jacobrichard commented 7 years ago

I pulled in the #248 fix and pushed it into our environment. As @tehlers320 reported, the tree is snappy but now I'm seeing a different error when retrieving metrics.

ERROR [2016-10-07 18:12:30,789] epollEventLoopGroup-3-1 - io.cyanite.api could not process request
java.lang.IllegalArgumentException: Don't know how to create ISeq from: clojure.core$partial$fn__4759
    at clojure.lang.RT.seqFrom(RT.java:542) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.RT.seq(RT.java:523) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$map$fn__4785.invoke(core.clj:2637) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.RT.seq(RT.java:521) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:641) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$mapcat.invokeStatic(core.clj:2674) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$maybe_multiplex.invokeStatic(api.clj:109) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$fn__7657.invokeStatic(api.clj:151) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$fn__7657.invoke(api.clj:145) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.MultiFn.invoke(MultiFn.java:229) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$process.invokeStatic(api.clj:89) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$make_handler$fn__7666.invoke(api.clj:167) [cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.http$request_handler$fn__1910.invoke(http.clj:110) [cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.http$netty_handler$fn__1918.invoke(http.clj:125) [cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.http.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:428) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:276) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:263) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:243) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:840) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:830) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:348) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) [cyanite-0.5.1-standalone.jar:na]
ifesdjeen commented 7 years ago

@jacobrichard thanks for catching this one. Could you tell me what kind of queries are you running?

jacobrichard commented 7 years ago

This is just a metric retrieval from graphite-web. Specifically for one of the internal reporter metrics for cyanite:

cyanite.us-west-2.$hostname.cyanite.ingestq.events.count

I redacted hostname (since it was an IP), but thats the path to the metric from graphite-web

ifesdjeen commented 7 years ago

@jacobrichard @tehlers320 right it was my bad. My usual testing path did not include graphite-web (until now). I'll test more thoroughly with graphite-web from now on. It's fixed and force-pushed to #248.

For #248 i would not expect changes in performance yet (this is a job for #249), but I hope to finish both of them over this weekend. #248 only exposes _min, _max and other metrics (to close #244).

tehlers320 commented 7 years ago

Out of curiosity would the ElasticSearch index have had this problem as well ?

ifesdjeen commented 7 years ago

Yes, this was only because we've included these names. It's only because of these "fake" metrics. For grafana, you would expect the metrics to pop up in autocomplete. For name expansion via graphite-web you don't, so that leads to the trouble..

tehlers320 commented 7 years ago

Are we talking about the same issue, i mean the timeout issue due to the table being too big?

ifesdjeen commented 7 years ago

You're right. It all boils down to what kind of wildcard is supported: if we can only query by prefix and/or suffix, it'll still be the same...

Technically, we could have a Cyanite node-local index, but then we'd have synchronisation and / or update problems...

tehlers320 commented 7 years ago

It's really docker that makes this un-manageable i think. Our tree just grows infinitely since they change so often and make new trees over and over.

stats.gauges.foo-app.ads1239adsfz stats.gauges.foo-app.shadsf1239ad stats.gauges.foo-app.89asdf39adsf

The hash comes from the servername inside the container. We are kicking around an idea to have the host come online and "register" with something and then changing the name to "server001, server002" based on how many of the same nodes are up . I wonder if anybody has solved this problem already. Even with influxdb over time you would have millions of hostname tags.

ifesdjeen commented 7 years ago

After a lot of back-and-forth I've figured how to use tokeniser for better and faster queries and more lightweight trees in #256