jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.47k stars 2.44k forks source link

Use Protobuf and gRPC for internal communications #773

Open yurishkuro opened 6 years ago

yurishkuro commented 6 years ago

Problem

At the moment we have the following data models handled by the backend, often with two-way transformers between them:

We also have limitations on routing traffic between agents and collectors because they use TChannel as the communication protocol.

Proposed Solution

Converge on a Protobuf 3 Model

By converging on a single Protobuf model we can consolidate some of these models, reduce the number of converters, and unblock / enable several other work streams:

Use gRPC with Protobuf model

User-Facing Deliverables

Phase 1 - MVP

Goal: deprecate TChannel.

Phase 2

Phase 3

Implementation notes

We've talked about this a lot, but there are many moving pieces that must come together, so this is the meta issue to list them all out, in random order.

Implementation plan

isaachier commented 6 years ago

I'm wondering if you need the JSON at all. You might be able to use JavaScript Protobuf on the browser too. Not saying JSON isn't simpler, just an idea if it isn't straightforward.

yurishkuro commented 6 years ago

For Jaeger UI we might be able to use https://github.com/grpc/grpc-web. But I still think we need native JSON support, if only for jaeger-client-javascript to be able to report traces from the browser. I don't think it would be a good idea to insist on having grpc-web in the browser applications as a dependency, given that JSON is a lot more natural and easily available (possibly for other languages as well).

jpkrohling commented 6 years ago

There's gRPC Gateway, which can take care of providing transparent REST+JSON support: https://github.com/grpc-ecosystem/grpc-gateway

tiffon commented 6 years ago

Re jaeger-client-javascript, using protobufs instead of JSON is an option:

https://github.com/dcodeIO/protobuf.js

Has broad browser compatibility: http://dcode.io/protobuf.js/#compatibility

isaachier commented 6 years ago

So I briefly looked into this last night and the consensus was that gzipped JSON will almost always beat Javascript protobuf.

tiffon commented 6 years ago

consensus was that gzipped JSON will almost always beat Javascript protobuf.

@isaachier Can you refer to relevant material? And, beat in what sense, speed, size, etc?

Also, some perf benchmarks from dcodeIO/protobuf.js (in node, not the browser). There is a very significant difference between the Google implementation and decodeIO's. Which implementation was used in your investigation?

benchmarking combined performance ...

protobuf.js (reflect) x 275,900 ops/sec ±0.78% (90 runs sampled)
protobuf.js (static) x 290,096 ops/sec ±0.96% (90 runs sampled)
JSON (string) x 129,381 ops/sec ±0.77% (90 runs sampled)
JSON (buffer) x 91,051 ops/sec ±0.94% (90 runs sampled)
google-protobuf x 42,050 ops/sec ±0.85% (91 runs sampled)

   protobuf.js (static) was fastest
  protobuf.js (reflect) was 4.7% ops/sec slower (factor 1.0)
          JSON (string) was 55.3% ops/sec slower (factor 2.2)
          JSON (buffer) was 68.6% ops/sec slower (factor 3.2)
        google-protobuf was 85.5% ops/sec slower (factor 6.9)
isaachier commented 6 years ago

OK consensus might be an overstatement. See this article: https://blog.wearewizards.io/using-protobuf-instead-of-json-to-communicate-with-a-frontend.

isaachier commented 6 years ago

For the record, I wouldn't use node as a good indicator of frontend performance. Node C++ extensions can easily add speed ups that aren't available in the browser.

yurishkuro commented 6 years ago

I believe we need to support JSON format for the following reasons:

yurishkuro commented 6 years ago

Regarding Go rendering of duration in JSON (e.g. 1666h40m, 0.016000013s) - another alternative is to record end timestamp instead of duration. Pros:

Cons:

isaachier commented 6 years ago

I wouldn't worry too much about the size of the span on the wire. As long as it doesn't hurt performance somewhere, people generally don't care.

isaachier commented 6 years ago

Fun fact:

Proto3 supports a canonical encoding in JSON, making it easier to share data between systems. The encoding is described on a type-by-type basis in the table below.

https://developers.google.com/protocol-buffers/docs/proto3#json

tiffon commented 6 years ago

For duration, why not just use nanoseconds?

yurishkuro commented 6 years ago

Nanoseconds are fine in 64bit, but Javascript only understands 53bit and that's 100 days in nanos - uncomfortably low number. The trade-off is we either render 64bit duration to JSON as a string, e.g. 1666h40m0.016000013s, or we replace duration with end_timestamp.

yurishkuro commented 6 years ago

I suppose another option would be to render int64 as a string in JSON, to avoid precision loss in Javascript.

tiffon commented 6 years ago

If this is our canonical storage model, too, end timestamps would make it more difficult to do duration based queries.

What about storing the duration as milliseconds as a float?

tiffon commented 6 years ago

FYI, milliseconds as a float is the approach taken by the Performance API.

isaachier commented 6 years ago

I'm wondering if we'd care about the precision for that. Otherwise, struct timespec is probably the underlying implementation for all of these languages so we can copy it to Protobuf.

yurishkuro commented 6 years ago

That struct is a regular nanos timestamp. It's already supported by protobuf.

yurishkuro commented 6 years ago

Fwiw, average span size in our deployment is between 400-500 bytes (in thrift), so we'd be adding 1% overhead on the data size by switching from duration to timestamp.

simonpasquier commented 6 years ago

My 2 cents from Prometheus experience.

We probably want to use gogoprotobuf instead of the default protobuf because gogo allows generation of the data model without too many pointers, e.g. []model.KeyValue instead of []*model.KeyValue (gist)

Prometheus uses gogoprotobuf with no issue.

https://github.com/grpc-ecosystem/grpc-gateway seems to work, apparently can even generate swagger

Prometheus also uses grpc-gateway for its v2 API. The API is experimental and provided without stability guarantee but it is working fine and indeed there's even a Swagger documentation generated from the Protobuf definition. More details in this Google doc.

tiffon commented 6 years ago

Seems the main benefits of using floats to store the duration in milliseconds is simplicity; ease to support duration based queries; it's a bit more aligned, semantically, with the property we're interested in storing and string parsing is not required in any context.

johanbrandhorst commented 6 years ago

KeyValue.VType=STRING and SpanRef.Type=CHILD_OF are omitted by default (as zero values)

This is configurable in the marshaler. I would recommend using OrigName as well for snake_case variable names.

Note that committing to a JSON representation limits your options for backwards compatible changes (no field renaming). This is a subtle difference from usual proto3 forward/backward compatibility benefits.

Falco20019 commented 6 years ago

Btw, GRPC allows client side load balancing by default. Just use GRPCLB. The default discovery is by using DNS naming and point to the load balancer.

yurishkuro commented 6 years ago

I have the proto model defined in gogoproto branch with gogo extensions, and a sample grpc-gateway that expose gRPC service as REST API. I also took a stab at using the proto model as the domain model (gogoproto-model branch). As expected, it breaks a bunch of unit tests that were using different serialization format of the fixtures, but also breaks the compile for binaries due to the following issues:

johanbrandhorst commented 6 years ago

@yurishkuro note that if the JSON representation is very important to be in a specific format, you can use the google.protobuf.Struct from the struct.proto Well Known Type. These are meant to help with JSON schema design. It obviously compromises your proto types a little, but it can be done.

johanbrandhorst commented 6 years ago

If adding a new WellKnownType to your protofile, you'll need to adjust the protoc generation steps; you'll need another -Mgoogle/protobuf/struct.proto:github.com/gogo/protobuf/types to each x_out invocation.

johanbrandhorst commented 6 years ago

I would at the same time consider replacing the Makefile with prototool :smile:.

bufdev commented 6 years ago

I recommend against using grpc-gateway for this purpose. Contact me offline for alternatives.

johanbrandhorst commented 6 years ago

@peter-edge mind sharing your rationale? I think the grpc-gateway is an increasingly popular choice for this kind of thing and would love any insights you might have.

yurishkuro commented 6 years ago

@jaegertracing/jaeger-maintainers I've reached an indecision point with #856. I am fairly happy with gogo-generated classes from the proto definition and can use them as a replacement for the domain model, which is great as it removes one (almost always present) conversion step. But I am a bit stuck with JSON serialization. Currently I have custom marshalers that serialize trace-id and span-id as hex strings (even though internally trace-id is a struct of two uint64's and span-id is uint64). I had one remaining todo to figure out serialization of the duration field. But what I don't like is that while I can do all those custom serializations in Go, they are not as easy, if at all possible, to reproduce in other languages. It would be much simpler to just let protobuf do its default JSON serialization (proto3-json), which should be compatible across languages. It would also have better compatibility long term with tools like grpc-gateway, swagger, etc. because the JSON would match the proto definition (currently trace-id doesn't). Specifically, these would be the differences from the current jaeger ui JSON:

The base64 encoding of trace/span-id is actually the most problematic because everywhere else we're using hex strings.

bufdev commented 6 years ago

I highly recommend using the protobuf JSON mapping - not only is it repeatable cross-language with each protobuf implementation, there’s actually some edge cases where correctness is an issue.

isaachier commented 6 years ago

Isn't the JSON only significant for UI? If Javascript is the only problematic language, then we would just need to find a solution to that. Any custom encoding/decoding sounds like something worth avoiding.

yurishkuro commented 6 years ago

@isaachier we are trying to define the official, long term API, which in theory could be consumed by other tools, not just the Jaeger UI. But at the moment it is just the UI, and in fact in #856 I haven't changed its custom representation that's different from the domain model.

BTW, agreeing to use bytes for IDs with base64 encoding makes the model similar to zipkin and census proto models. It's also more easily extensible for longer IDs if needed. I can still use a custom type like type TraceID [16]byte to avoid memory allocations in the domain model.

Falco20019 commented 6 years ago

The problem with default deserialization of durations and timestamps in JSON is that some languages use ms and some use ns as unit... I often had problem between Java (serializing ns) and C# (serializing ms) and also made an PR for RestSharp a year ago to accept both. Guessing works somewhat fine for timestamps but not so good for durations. The string is less error prone in that specific case.

This would also allow for the JSON data to be smaller. For example to encode duration of one day, you would have 86400000 as ms value or 1d as string.

Also, in proto3, enums should either have the regular default case as 0 value or use Undefined in case that there is no good default. That's also for performance, since in the proto-wire format, the default values are omitted to improve performance. So I see that to make sense for SpanRef.Type=CHILD_OF but less for KeyValue.VType=STRING. That's also why I would prefer to use oneof for KeyValue. The enum would be implicit and never has to be sent as separate value vType since the type is already implicit from what field is there. So if KeyValue.VStr is set, you already now that the KeyValue.VType has to be STRING. No need to transfer that over the wire.

We have that in the current tchannel-based implementation, but gRPC doesn't provide it out of the box (or does it?) Have a look at grpc/load-balancing.md and the gRPC load balancing blog entry. AFAIK client-side load balancing right now supports round-robin or using GRPCLB. Name resolution is mostly done using DNS, but for some languages, there exist interfaces to define your own resolver. Sadly not for C# yet...

yurishkuro commented 6 years ago

That's also for performance, since in the proto-wire format, the default values are omitted to improve performance.

I actually liked that aspect that both CHILD_OF and STRING are not included in the JSON since they occur in 99% of cases and are very good defaults.

isaachier commented 6 years ago

Regarding snake case: https://developers.google.com/protocol-buffers/docs/style.

Falco20019 commented 6 years ago

I actually liked that aspect that both CHILD_OF and STRING are not included in the JSON since they occur in 99% of cases and are very good defaults.

Ok, the initial post sounded like you want to get rid of that behaviour (stating some could be addressed via custom renderers). Because I also like avoiding useless data transfers.

@isaachier Thanks for the link to the style guide. Totally forgot that it's also explicitly documented :)

bufdev commented 6 years ago

You'll probably want to follow the Uber Style Guide, enforced by Prototool :-) https://github.com/uber/prototool/blob/dev/etc/style/uber/uber.proto you can think of it as "strict mode" as compared to the above.

yurishkuro commented 6 years ago

For reference: Docker images for protoc compiler https://github.com/namely/docker-protoc

pavolloffay commented 6 years ago

I am working on this now. I want to implement a minimal working solution using grpc communication between agent and collector and get it merged to master as experimental.

So far I have rebased @isaachier PR https://github.com/jaegertracing/jaeger/pull/1042 and added proto for sampling. I will be submitting PR shortly.

I will be posting my notes here what should be done afterward:

yurishkuro commented 6 years ago

added proto for sampling

Would it move this along better if we start merging smaller pieces into master, e.g. my branch & Isaac's PR? I think the tests were outstanding.

pavolloffay commented 6 years ago

I think it would break me right now. I will submit a PR to master based on those two. Now fixing the coverage.

isaachier commented 6 years ago

Hey @pavolloffay. Glad to see this finally happening! Please let me know if you need any help here.

pavolloffay commented 5 years ago

I am wondering whether we should rename --collector.grpc-port to something else like --collector.grpc.port or even scope that with service e.g. batches (though, it seems odd). The question is if we want to expose more services on the same port or use different grpc ports for different services. The first comment also includes

Combine gRPC and HTTP handlers on the same port

cc @yurishkuro @jpkrohling

yurishkuro commented 5 years ago

why do we need "host"?

pavolloffay commented 5 years ago

my bad there should not be host. The most important is scoping it under .grpc.

yurishkuro commented 5 years ago

For now I suggest keeping it simple without scoping. and serve multiple services on the same port.

annanay25 commented 5 years ago

For the query-service, would it be better to multiplex the GRPC and HTTP handlers on the same port as compared to hosting two endpoints?

yurishkuro commented 5 years ago

yes, it is mentioned in the implementation plan