thinkaurelius / titan

Distributed Graph Database
http://titandb.io
Apache License 2.0
5.25k stars 1.01k forks source link

Add Metrics monitoring #268

Closed dalaro closed 11 years ago

dalaro commented 11 years ago

We would like to instrument Titan using Metrics. Let's consider what to measure with Metrics and what application(s) we plan to initially use atop Metrics for charting (e.g. ganglia or graphite).

espeed commented 11 years ago

From a previous email...

Here are the three leading application performance management systems (in order) -- see http://www.compuware.com/application-performance-management/compuware-leadership-in-the-APM-market.html.

dynaTrace and AppDynamics specifically support Cassandra, and AppDynamics has partnered with DataStax (see the Martin Stone video below).

Opnet

dynaTrace

AppDynamics (used by Netflix, DataStax partner)

---//---

Killing two birds with one stone...

There is also Twitter's Zipkin monitoring system, which hooks into Finagle, Twitter's open-source RPC system for the JVM used to construct high-performance and high-concurrency servers:

Both are Apache 2.0 licensed.

Finagle might be a good candidate for Rexster 3.0 or TitanServer. It's built on Netty and implements uniform client and server APIs for several protocols. Most of Finagle’s code is protocol agnostic, simplifying the implementation of new protocols.

See the number of protocols it already supports: https://github.com/twitter/finagle

Matt Ho gives a great 35 min presentation on Finagle and the considerations/challenges to think about when designing an RPC system/protocol:

http://marakana.com/s/post/1416/twitter_finagle_for_the_asynchronous_programmer_matt_ho_video

One of his key points is that clients are harder to design than servers, and he advocates building on existing protocols.

Making it easy to build robust clients is a key consideration since you need a robust client in a language before people in that language community can start using the system.

Using an existing protocol that has existing battle-tested libraries would make developing robust RexPro clients easier and help ensure more consistent quality across implementations, without requiring driver authors to roll their own support for pooling, reconnects, etc.

Finagle has built in support for SPDY, which HTTP 2.0 is based on, and SPDY has nice properties to make it ideal for the backend as well.

See https://github.com/tinkerpop/rexster/issues/297

Finagle is written in Scala, but provides both Scala and Java idiomatic APIs.

There are more presentations/discussions at the bottom this page: http://twitter.github.io/finagle/

Going the Finagle/Zipkin route would provide an Apache 2.0 RPC and monitoring platform with a large team behind it, and it would make developing client drivers easier.

spmallette commented 11 years ago

@dalaro let me know if you need any help on this one. it's probably harder to figure out "what" to track than it is to implement, but given the current state of Titan Server with embedded Rexster it would nice if we could share a single MetricRegistry instance (I've already setup Rexster to allow for that).

dalaro commented 11 years ago

@espeed thanks for the pointers.

I really like Dapper/Zipkin style distributed tracing. The ZIpkin UI's nested breakdown of function runtimes is powerful. I think it would also take substantially more engineering to get Titan instrumented for Zipkin than for Metrics. We don't use Finagle and probably can't wrap existing communication inside Finagle RPC in a sane way. It looks like we might need to write a new Titan component to track Zipkin trace and span IDs and to send trace messages using Thrift. The Zipkin project documents that format pretty well -- it's just implementing that would take some thought. I think we'd additionally need to run Scribe and Zookeeper to support Zipkin. This is probably worthy of its own feature issue. I'm glad you mentioned it.

dalaro commented 11 years ago

@spmallette I've been thinking about initially tracking Timers and Counters around the diskstorage layer in Titan where we communicate with Cassandra/HBase/etc. Timers seem powerful since they deliver a histogram of durations. I've also been thinking about how to chart Timer's histogram output. This is an interesting approach to histogramming in graphite:

http://dieter.plaetinck.be/histogram-statsd-graphing-over-time-with-graphite.html

It's not exactly what I had in mind, since he's looking at frequencies rather than runtimes, but I think the approach may be adaptable. What are you using to collect and render Metrics output?

Regarding sharing a MetricRegistry instance. We surely won't have metric name collisions regardless of the registry or registries used since our classes are in different packages. I'd like to share a MetricRegistry across Titan and Rexster if it streamlines data collection and the implementation is clean. For deployments where Titan runs alone, is there a way to ask rexster-core for a MetricRegistry instance without first instantiating a Rexster service?

zachkinstner commented 11 years ago

Notes on dependencies from the issue referenced above:

spmallette commented 11 years ago

@dalaro as i mentioned over in zach's issue, i wish i'd seen the metrics version discrepancies before i got too deep with stuff. there were definitely some things in the stable beta that i wanted to make use of, but that's now left some dependency issues. perhaps there is some fancy pom work that can be done here?

I am using Timers in most cases. There are a small number of counters/guages where it made sense. I also proxy JMX objects from underlying server pieces like Grizzly so that they are available in a a single unified way.

Getting Titan to use the same registry is pretty easy and it will work even when Titan runs alone. Check out this class:

https://github.com/thinkaurelius/titan/blob/master/titan-core/src/main/java/com/thinkaurelius/titan/tinkerpop/rexster/TitanRexsterApplication.java#L20

Note that it extends this class:

https://github.com/tinkerpop/rexster/blob/master/rexster-core/src/main/java/com/tinkerpop/rexster/server/AbstractMapRexsterApplication.java

You just need to override the getMetricRegistry method in TitanRexsterApplication to return the Titan instance of the MetricRegistry. Rexster will then use that instance when registering its metrics. The metrics will no longer be prefixed with "rexster" but with whatever you name the registry which i think makes sense. Assuming you name the registry "titan", we'll get things like "titan.rexpro.sessions". Should work nicely if we can solve the dependency issues.

dalaro commented 11 years ago

@zachkinstner thanks for linking to Fabric issue and pointing out HBase's Metrics dependency. I just found the Cassandra dep on 2.0.3 but hadn't seen the HBase one yet.

@spmallette what fancy pom work did you have in mind?

I'm considering the case where Rexster, Titan, and Cassandra are all running inside a shared JVM and each one wants to load Metrics. It's probably technically possible to resort to some unholy mess of classloader customization, but I'd like to avoid that in favor of something cleaner, if it exists. There might also be a way to exploit the fact that the Metrics package prefix changed from com.yammer to org.codahale around 3.0.0-BETA2.

I assume that any technical approach to using multiple versions of Metrics is going to cause some degree of ugliness and maintenance hazard, but I can't really judge the degree until I get into the details. I'm going to experiment...

Are you pretty much sold on using a newer Metrics (i.e. 3.0.0+) in Rexster? I don't know the functional differences between 2.0.3 and the latest versions of Metrics.

dalaro commented 11 years ago

I mistakenly wrote org.codahale above. The new package prefix is com.codahale.

zachkinstner commented 11 years ago

It may not be useful, but I was curious about it -- Cassandra's trunk branch uses Metrics 2.2.0: https://github.com/apache/cassandra/blob/trunk/build.xml#L388

spmallette commented 11 years ago

@dalaro if i remember correctly, some of the api elements were more straightforward. it also came ready out of the box with the admin servlets and the jersey integration which wasn't there before hand. all-in-all, it was enough for me to look past the beta label and go for it.

dalaro commented 11 years ago

@spmallette One approach: use the latest Metrics beta (3.0.0-BETA3) in Rexster and Titan and leave the old Metrics versions in Cassandra and HBase unchanged. Metrics 3.0.0-BETA2 and -BETA3 use the new package prefix com.codahale.metrics. Cassandra and HBase would each retain their older Metrics versions under the com.yammer.metrics package prefix. This is a hack, but it would allow embedding Cassandra and Rexster in the same JVM without classloader chicanery for Metrics or reverting/reimplementing Rexster's Metrics features.

What do you think?

I built and tested titan-cassandra against the current rexster master (2.4.0-SNAPSHOT). As expected, the test RexsterServerClientCassandraEmbeddedTest failed with an exception similar to the one @zachkinstner identified and referenced:

java.lang.NoClassDefFoundError: com/yammer/metrics/Metric
        at com.thinkaurelius.titan.tinkerpop.rexster.RexsterTitanServer.<init>(RexsterTitanServer.java:55)
        at com.thinkaurelius.titan.tinkerpop.rexster.RexsterServerClientTest.setUp(RexsterServerClientTest.java:49)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        ...
Caused by: java.lang.ClassNotFoundException: com.yammer.metrics.Metric
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        ...

I then bumped my local Rexster Metrics version forward to 3.0.0-BETA3 and changed all the com.yammer.metrics references to com.codahale.metrics. Now the test passes. When I build titan-all, the resulting distribution zipfile contains both lib/metrics-core-3.0.0-BETA3.jar and lib/metrics-core-2.0.3.jar. There could still be a pitfall that I haven't found...

zachkinstner commented 11 years ago

@dalaro, I followed your setup, and I can confirm that RexsterTitanServer starts up without the error I was experiencing before. I haven't done any further investigation, however.

Also: bonus vocab points for "chicanery" :+1:

jwtodd commented 11 years ago

i'm a huge fan of yammer-metrics [ http://metrics.codahale.com/ ]

i've instrumented apps with the library before which includes one-line hooks to publish said metrics to grahpite/etc.

not sure if this is too low level or not, but can't speak highly enough about this metrics lib.

been meaning to work more with titan ... if a decision is to go w/ yammer-metrics, i'd most assuredly lend a hand :)

best,

zachkinstner commented 11 years ago

With the following configuration, metrics are being displayed in the console. I get an empty result with the HTTP /metrics page, and nothing seems to be sent to Graphite.

<metrics>
        <reporter>
            <type>console</type>
            <properties>
                <properties>
                    <rates-time-unit>SECONDS</rates-time-unit>
                    <duration-time-unit>SECONDS</duration-time-unit>
                    <report-period>10</report-period>
                    <report-time-unit>SECONDS</report-time-unit>
                    <excludes>http.rest.*</excludes>
                </properties>
            </properties>
        </reporter>
        <reporter>
            <type>http</type>
            <properties>
                <properties>
                    [...same...]
                </properties>
            </properties>
        </reporter>
        <reporter>
            <type>graphite</type>
            <properties>
                <hosts>{my_graphite_server_internal_ip}:2003</hosts>
                <prefix>fabric</prefix>
                <properties>
                    [...same...]
                </properties>
            </properties>
        </reporter>
    </metrics>

The log says the HTTP and Graphite modes are being configured, at least.

13/05/18 03:38:00 INFO metrics.HttpReporterConfig: Configured HTTP Metric Reporter.
13/05/18 03:38:00 INFO metrics.ConsoleReporterConfig: Configured Console Metric Reporter.
13/05/18 03:38:00 INFO metrics.GraphiteReporterConfig: Configured Graphite Metric Reporter [<my_internal_graphite_ip>:2003].
13/05/18 03:38:00 INFO metrics.GraphiteReporterConfig: Enabling GraphiteReporter to <my_internal_graphite_ip>:2003
zachkinstner commented 11 years ago

My Titan metrics branch replicates the ReporterConfig setup from com.tinkerpop.rexster.Application. I'm not sure if additional setup is required on the Titan side to make metrics work.

The console metrics output doesn't respect my <excludes>http.rest.*</excludes> setting.

zachkinstner commented 11 years ago

In the commit referenced above, I was able to get the HTTP metrics to work by making ReporterConfig setup occur before RexProRexsterServer and HttpRexsterServer creation. My http.rest.* excludes are still ignored (HTTP shows items like http.rest.vertices.extension.get), and my Graphite server isn't showing any data. It is entirely possible my Graphite server/setup is flawed.

zachkinstner commented 11 years ago

It is entirely possible my Graphite server/setup is flawed.

"Flawed" was an understatement :flushed:. Once I started Graphite, things are working well. In my own defense, I've run this command before, but must have rebooted the server... or something...

spmallette commented 11 years ago

@zachkinstner thanks for proving out the theory for integration.

@jwtodd i liked working with metrics. definitely found it intuitive and easy to introduce.

zachkinstner commented 11 years ago

No problem... I'm happy to help, and even happier to have my Lego pieces fitting together.

I was a little concerned about the changes in latest commit. I had to move RexProRexsterServer and HttpRexsterServer instantiation out of the RextsterTitanServer constructor and into the start method instead. It seems to work fine, and I suppose it shouldn't cause an issue as long as start is only called once. The root cause was that ReporterConfig setup needed to occur first.

dalaro commented 11 years ago

This is going into 0.4.0 and documented here: https://github.com/thinkaurelius/titan/wiki/Titan-Performance-and-Monitoring

Additional work could be done, like instrumenting more of Titan's internals, or adding new reporters, but I think the current master and wiki doc is a good breaking point for this issue.