Define DSL for analysis and query span data

pavolloffay commented 4 years ago

Created based on https://github.com/jaegertracing/jaeger/issues/1639#issuecomment-534097232.

Define domain-specific language (DSL) for analysis and query Span data. An example from Facebook's canopy system:

65429369-5a787000-de16-11e9-8ace-b9b25bfb1f61

The library should be able to connect to any Span source - jaeger query, json file, storage.

DSL in Canopy https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/

cc @jaegertracing/data-analytics

jpkrohling commented 4 years ago

Would it make sense to start by implementing GraphQL (#169)?

pavolloffay commented 4 years ago

@yurishkuro I am wondering do we want to make the library to work in a distributed way (like spark RDD)?

In the previous discussions, we have also mentioned that we could reuse existing graph traversal frameworks e.g. Gremlin. I am not sure if GraphQL #169 provides the same capabilities or it is used only for UI integrations.

pavolloffay commented 4 years ago

We should also think about use cases this feature would solve:

users could ask the system more complicated questions e.g. - test a hypothesis
the query could be periodically run and invoke an action (alert) if conditions are met

pavolloffay commented 4 years ago

There are two popular graph query languages - Gremlin (https://tinkerpop.apache.org/gremlin.html) and Cypher (https://neo4j.com/developer/cypher-basics-i/)

Gremlin

supports multiple languages: groovy, java, python
integrated with Apache Spark - http://tinkerpop.apache.org/docs/current/reference/#spark-plugin (java), python?
Apache Flink integration with Gelly https://flink.apache.org/news/2015/08/24/introducing-flink-gelly.html, https://medium.com/@matgug/part-2-twitter-graph-analysis-with-gremlin-31550f9d134, I could not find more resources

Cypher

supports multiple languages: java, python (pypher https://neo4j.com/blog/express-cypher-queries-pure-python-pypher/), https://github.com/Wolfgang-Schuetzelhofer/jcypher/wiki/Domain%20Mapping
integrated with Apache Spark https://github.com/opencypher/morpheus (scala). It will be included in Spark 3.0 https://neo4j.com/blog/spark-developers-include-cypher-spark-3-0/
Apache Flink integration https://community.neo4j.com/t/about-the-stream-processing-category/33

pavolloffay commented 4 years ago

I am not sure how we could use this query language without backed supporting it. To use gremlin we would have provide a gremlin compatible layer to allow query execution. @jaegertracing/data-analytics @yurishkuro any ideas?

Maybe running the query on the subset of data directly in-memory would work.

annanay25 commented 4 years ago

Would it be easier if we could curate traces from a (relatively) complex system that someone in the community is running in production and would volunteer to publish? It would move focus from data collection to actual analysis and it would also help different teams collate and confirm results while working on the same data-set.

Didn't dig very deep but seem relevant - https://github.com/google/cluster-data

yurishkuro commented 4 years ago

@pavolloffay there are several parts to the DSL/library:

1. a way to define a stream of traces

This may include:

loading traces from files
loading traces from a Kafka topic (e.g. start 10,000 messages back) that contains spans.

In case of a source providing just spans, there needs to be a pre-aggregation step that assembles them into traces. This creates interesting challenges when done on a live stream as opposed to historical data, since on historical data we can simply group-by, while with a live stream we need to use window aggregation

The output of the first step is RDD-like stream of traces.

2. Filtering step

This is where the first part of the DSL comes in - how to express a query on a trace, when trace is represented as a graph. Joe's proposal didn't really address the graph nature of the trace, only filtering conditions on individual spans (which could also be a valid use case).

3. Evaluation / feature extraction step

The second part of the DSL - expressing feature extraction computation on the graph, like the Facebook's Canopy example above. Note an interesting thing in that example - it operates on a trace almost like on a flat collection of spans. They probably have expressions that can walk the graph, like $node->parent, but they didn't show it in the public talks.

I think the minimum DSL we need is just an ability to walk the in-memory representation of the trace as graph (i.e. for n in node.childen ...) and extract data (e.g. span.operationName, span.tag['key']`). The actual evaluations can be normal programs, in case of the filtering step returning boolean.

In other words, what we need is just a data model, and maybe some simple helper functions for finding things, like browser_thread = trace.execution_units[attr.name == 'client'], which is really, in generic sense, is func (t *Trace) findSpans(predicate func(*Span) bool) []*Spans. Helpers can actually come later, as long as we have the data model people can write them themselves initially.

pavolloffay commented 4 years ago

I have started defining DSL with gremlin in https://github.com/pavolloffay/jaeger-tracedsl

Here is an example from app class https://github.com/pavolloffay/jaeger-tracedsl/blob/master/src/main/java/io/jaegertracing/dsl/gremlin/App.java

    TraceTraversalSource traceSource = graph.traversal(TraceTraversalSource.class);
    GraphTraversal<Vertex, Vertex> spans = traceSource
        .hasTag(Tags.SPAN_KIND.getKey(), Tags.SPAN_KIND_CLIENT)
        .duration(P.gt(100));

    for (Vertex v : spans.toList()) {
      System.out.println(v.label());
      System.out.println(v.property(Keys.OPERATION_NAME).value());
      System.out.println(v.keys());
    }

You can see how the filtering and extraction look like. The API allows to use trace DSL but also core gremlin API at the same time. This is a simple example but it should be possible to do things like:

determine if two spans are connected graph.connected(tagsSpan1, tagSpan2)
process distance between two spans
distribution of service/process depth/breadth

Any suggestions are welcome. My next step would be:

more complicated filtering methods outlined above
simplify extraction/iteration (children(), root()...)
graph creation API - from file (downloaded from UI?), jaeger-query, directly from storage

annanay25 commented 4 years ago

In case of a source providing just spans, there needs to be a pre-aggregation step that assembles them into traces. This creates interesting challenges when done on a live stream as opposed to historical data, since on historical data we can simply group-by, while with a live stream we need to use window aggregation

@yurishkuro - Is this aggregator component available in open source?

yurishkuro commented 4 years ago

@annanay25 there's not much to it: https://github.com/PacktPublishing/Mastering-Distributed-Tracing/blob/master/Chapter12/src/main/java/tracefeatures/SpanCountJob.java#L55

pavolloffay commented 4 years ago

I have made some progress in my repository. The repository so far contains:

Gremlin trace DSL - defined methods for easier filtering and iteration over graph (extraction) Examples with gremlin - e.g. find a span with given properties, are two spans connected? What is the distance between two spans? What is the maximum depth of trace (based on spans not services)? Spark streaming with Kafka connector. It reads kafka topic in intervals, groups by traceids, creates graph for each trace and extracts max deph of the trace and prints it to stdout.

The next steps are:

publish extracted features to another kafka topic and get them to Prometheus.
wrap the code to jupyterlab notebook
get a trace query REST API
write a blog post
move span proto to IDL repository and build to java and python, we should consider publishing model classes to maven/pip..
publish graph DSL as a library
make it easy to deploy jupyterlab on k8s and connect to kafka
create a distribution (like spark-dependencies) with models/metrics which prove to be useful.

pavolloffay commented 4 years ago

It would be great if somebody could help with moving protos to IDL and make configure build process to different languages https://github.com/jaegertracing/jaeger/issues/1213.

jaegertracing / jaeger