Support using Confluent Schema Registry to serialize traces

jeqo commented 4 years ago

Include a module to take incoming spans and serialize them in a format supported by Confluent Schema Registry (e.g. potentially Avro), to enable the Kafka ecosystem to consume tracing data (e.g. ksqldb, kafka connectors, etc.). Although both ksqldb and kafka connectors support plain JSON by default, using a binary-format is preferred for I/O efficiency.

Users will need to opt-in, as this will require additional infra (Schema Registry) to be available.

This serialization is meant to be used in the processing pipeline, from collectors downstream. clients reporting spans won't need to connect to SR to send spans to Kafka.

codefromthecrypt commented 4 years ago

note: schema registry is confluent license I think, not apache, iiuc. That said we also make an optional mysqldb image which is also not ASL, though it is not constrained commercially like mongo and confluent license are. It is a good idea to make something like this opt-in, knowing this, even if the commerical license doesn't effect non-vendors at all.

Ddoes schema registry have any plan for another format besides avro? If avro are you thinking a port of our proto or json? One of the tricks on SR is also the bootstrapping order is more when this is introduced.

PS here's a docker file that could be ported as the default image is massive https://github.com/hypertrace/schema-registry/tree/main/docker

codefromthecrypt commented 4 years ago

Also I think if I understand it right.. some products who use SR have the option not to, as long as they have a copy of the schema. I think pinot allows this. For example, you can supply a copy of the schema which still allows avro for a format like ours which is stable.. just a thought as not sure which products allow what and the ordering thing is really a bit of a pain, as we already have to bootstrap topics. For example, docker-compose we can be ok bootstrapping things with some conditions games, but compose format 3 takes more of those away, sadly

codefromthecrypt commented 4 years ago

personally keen on at least the avro part as an option (even knowing the format is higher overhead), as like you said certain features would be available which otherwise aren't. We could make a port of maybe the proto format into avro here then promote it to zipkin-api if becomes something others also want to use. Maybe if you can add to the description what using this gains them as it might not be obvious what each of the products that depend on it could do in context of trace processing. Ex an example or few.

cc @cwensel @narayaruna not sure if you all use avro at all or not..

jeqo commented 4 years ago

@adriancole about licensing: IIUC SR clients are Apache licensed https://github.com/confluentinc/schema-registry/blob/master/LICENSE, only SR server is confluent licensed.

Ddoes schema registry have any plan for another format besides avro? If avro are you thinking a port of our proto or json? One of the tricks on SR is also the bootstrapping order is more when this is introduced.

Current version supports also protobuf and json schema https://www.confluent.io/blog/confluent-platform-now-supports-protobuf-json-schema-custom-formats/

Maybe can be easier to start with any of these 2 and integrate with schema registry. As most sites use Avro from previous versions, I thought about starting with it. cc @jcchavezs as also interested on this.

I'm not that worried about ordering as SR clients can auto create schemas when starting to produce—if that's what you mean.

codefromthecrypt commented 4 years ago

I see.. is there a site who's interested in this specifically for zipkin? If not, might be less work and less special to experiment with a format we already support until a user wants this.

Though regardless, it is ok for you to experiment with whatever you feel like and avro is indeed popular albeit less efficient and the whole idea of schema is not hugely important in zipkin as we almost never change fields.

One thing I noticed in other work was the act of using schema adds overhead as the things that do that part aren't particularly efficiently written often enough. Ex double-buffering json problem is an additional cost on top of whichever validations. Also, our schema has very little to validate as only traceID and spanId are required fields...

jcchavezs commented 4 years ago

A couple of comments:

Also I think if I understand it right.. some products who use SR have the option not to, as long as they have a copy of the schema.

Yes this can be done but it is always under the assumption that the schema WON'T change. Remember when using S-R what you are basically doing is instead of passing the schema along with the message, you pass the ID (provided by S-R) so you can just ignore the ID and use your local schema but then that is always relying the schema won't change otherwise consumer is not going to be able to process it.

For example, docker-compose we can be ok bootstrapping things with some conditions games, but compose format 3 takes more of those away, sadly

Agree with this, bootstraping this things specially in local can become a pain so making this optional would be good.

We could make a port of maybe the proto format into avro here then promote it to zipkin-api if becomes something others also want to use.

I can volunteer to do this as I am interested on the schema. Should I just open a PR to https://github.com/openzipkin/zipkin-api? I guess it would be also interesting if we provide the POJOs or is that too much?

codefromthecrypt commented 4 years ago

I can volunteer to do this as I am interested on the schema. Should I just open a PR to https://github.com/openzipkin/zipkin-api? I guess it would be also interesting if we provide the POJOs or is that too much?

I would start here with impl but there with issue so the latter can accumulate rule of tres

openzipkin-contrib / zipkin-storage-kafka

Support using Confluent Schema Registry to serialize traces #74