open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.13k stars 2.4k forks source link

Add AVRO serialization to existing encoding extension #33006

Open TorsteinOtterlei opened 6 months ago

TorsteinOtterlei commented 6 months ago

Component(s)

extension/encoding/avrologencoding

Is your feature request related to a problem? Please describe.

This PR: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/31923 added Avro deserialization so that for example the kafka receiver can consume and decode avro log records.

However, for integrating the Collector with existing systems, it would also be nice to have this extension do serialization as well, so that it can emit logs in avro format. All of the other encoding extensions for logs support both deserialization and serialization, so it would be natural for the avro one to be able to do the same.

It would also be nice to have support for avro schema id encoding into the log records, which is a quite easy addition.

Describe the solution you'd like

extensions:
  avro_log_encoding:
    schema_id: 3
    schema: |
      {
        "type" : "record",
        "namespace" : "example",
        "name" : "Datapoint",
        "fields" : [
          { "name" : "Key" , "type" : "string" },
          { "name" : "Value" , "type" : "int" }
        ]
      }

exporters:
  kafka:
    encoding: avro_log_encoding
    # ... other configuration values

Describe alternatives you've considered

For the schema id functionality, it would of course be better to support schema registries. However since this would be a bigger change, and packages such as https://github.com/riferrei/srclient is not very actively maintained (and essentially just a wrapper for goavro which we already use). I think it is sufficient to support a static schema id like fluent bit does, and perhaps implement registry support later.

Additional context

The original issue for adding the avro extension: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/21067

I am aware that the extension/encoding modules are not ready to be used with all receivers or exporters, but it would be nice to add this functionality so that it is ready.

github-actions[bot] commented 6 months ago

Pinging code owners:

thmshmm commented 6 months ago

Hey,

I started already some time ago

https://github.com/open-telemetry/opentelemetry-collector-contrib/compare/main...thmshmm:opentelemetry-collector-contrib:feature/avrologencodingextension-serializer

I wanted to do some more testing but the encoding extensions support in the Kafka receiver/exporter needs to be developed.

TorsteinOtterlei commented 6 months ago

Awesome! I tried adding it myself just to test: https://github.com/TorsteinOtterlei/opentelemetry-collector-contrib/pull/1

Would you consider copying my schemaID embedding into to your implementation?

thmshmm commented 5 months ago

sure, ill have a look. i started working on the extension support in the receiver and next the exporter so it can actually be used. one thing i wanted to test before creating a pr is if it would be possible to instead of using Body().AsString() and thus requiring a properly formatted JSON string, use Body().AsRaw() with something like a map, which might be easier to construct in pipeline transformations. what do you think?

github-actions[bot] commented 3 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

TorsteinOtterlei commented 2 months ago

Hi again @thmshmm! Really appreciate the work you are doing for getting good kafka support to the collector! Sorry for not answering your previous question, since I did my experimentation I have forgotten the details, so I don't have a good answer.

I saw the discussion on https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/34047, with the relevant PR https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34064. If I understand correctly, the issue here is that batched messages entering the exporter would have to be individually avro encoded and transmitted? And that would be less performant and contradict users' expectations that messages are batched?

Is this a hard-requirement to get your avro encoding implementation merged?

Because, like you were saying, with both avro and kafka it is hard to replicate the batching that you can have with grpc. Not only because of how avro-encoding works, but when it comes to the the kafka producer, batching would be controlled by the batch.size and linger.ms settings rather than an actual number of messages, right?

Do you have any thoughts? Would love to help out on this if there is a chance we could get avro encoding for the collector.

thmshmm commented 1 month ago

hi @TorsteinOtterlei, the limitation with batching is that there is no AVRO concept of a high level array/list. So having a generic AVRO schema which represents one element, there is no way of mapping a batch into it. Maybe a way to go forward is to keep it as is, and print warnings if the received data contains more than one element so the user is aware that its not supported and only the first will be mapped.