vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.79k stars 1.57k forks source link

datadog_agent source incompatible with opentelemetry-collector datadogexporter client v2 (JSON vs Protobuf) #16121

Open sean-snyk opened 1 year ago

sean-snyk commented 1 year ago

A note for the community

Problem

We're collecting metrics with the opentelemetry-collector and sending them to vector using Datadog's submission protocol via the datadogexporter

The datadogexporter was recently (v0.69.0) upgraded to use native Datadog client libraries https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/16776 This results in vector rejecting the export requests with 422 Unprocessable Entity

Configuration

# VECTOR_LOG=debug  vector --config vector.yaml 

sources:
  otelcol:
    type: datadog_agent
    address: 0.0.0.0:8001
    store_api_key: false

sinks:
  null_route:
    type: blackhole
    print_interval_secs: 1
    inputs:
      - otelcol

Version

vector 0.27.0 (x86_64-unknown-linux-gnu 5623d1e 2023-01-18)

Debug Output

2023-01-25T11:00:24.284324Z DEBUG hyper::proto::h1::io: parsed 7 headers
2023-01-25T11:00:24.284359Z DEBUG hyper::proto::h1::conn: incoming body is content-length (473 bytes)
2023-01-25T11:00:24.284378Z  INFO source{component_kind="source" component_id=otelcol component_type=datadog_agent component_name=otelcol}: warp::filters::trace: processing request
2023-01-25T11:00:24.284417Z DEBUG hyper::proto::h1::conn: incoming body completed
2023-01-25T11:00:24.284497Z DEBUG source{component_kind="source" component_id=otelcol component_type=datadog_agent component_name=otelcol}: warp::filters::query: route was called without a query string, defaulting to empty
2023-01-25T11:00:24.284579Z ERROR source{component_kind="source" component_id=otelcol component_type=datadog_agent component_name=otelcol}: warp::filters::trace: unable to process request (internal error) status=500 error=Some(Rejection(ErrorMessage { code: 422, message: "Error decoding Datadog sketch: DecodeError { description: \"unexpected end group tag\", stack: [] }" }))
2023-01-25T11:00:24.284707Z DEBUG hyper::proto::h1::io: flushed 251 bytes

### Example Data

```yaml
# otelcol-contrib --config collector.yaml
# otelcol-contrib --config collector.yaml --feature-gates=-exporter.datadogexporter.metricexportnativeclient

# send test metric with
# echo "test.metric:42|c|#myKey:myVal" | nc -w 1 -u localhost 8125

receivers:
  statsd:
    endpoint: 0.0.0.0:8125
    enable_metric_type: true
    aggregation_interval: 1s
    timer_histogram_mapping:
      - statsd_type: histogram
        observer_type: histogram
      - statsd_type: timing
        observer_type: histogram

exporters:
  datadog:
    api:
      site: localhost:8001
      key: foobar
    metrics:
      endpoint: http://localhost:8001
      resource_attributes_as_tags: true
      instrumentation_scope_metadata_as_tags: true
    host_metadata:
      enabled: false

service:
  pipelines:
    metrics:
      receivers:
        - statsd
      exporters:
        - datadog

  telemetry:
    logs:
      level: debug

Additional Context

support for OTLP ingest of metrics would also be very much appreciated

References

https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17373#issuecomment-1403451116

davidhuie-dd commented 1 year ago

I reproduced this earlier today and confirmed.

mx-psi commented 1 year ago

Any updates on this @davidhuie-dd?

neuronull commented 1 year ago

Any updates on this @davidhuie-dd?

πŸ‘‹ I am starting to look into this today

neuronull commented 1 year ago

I confirmed that vector is attempting to parse the incoming request against the metrics v2 endpoint.

The failure we are observing:

unable to process request (internal error) status=500 error=Some(Rejection(ErrorMessage { code: 422, message: "Error decoding Datadog sketch: DecodeError { description: \"unexpected end group tag\", stack: [] }" }))

, is occurring while attempting to deserialize the incoming request, which we expect to be a gRPC frame.

The part unexpected end group tag , appears to come from protobuf land.

I checked the proto file Vector is using (https://github.com/vectordotdev/vector/blob/master/proto/ddsketch_full.proto) , against where it was sourced (https://github.com/DataDog/sketches-go/blob/0a92170/ddsketch/pb/ddsketch.proto) , and that all looks good.

So, I'm left to wonder if the incoming data is correctly structured.

mx-psi commented 1 year ago

The data sent by the Datadog exporter (and the Datadog Agent) to the /api/beta/sketches endpoint uses this proto file which predates and is diffferent from the DDSketch proto format. To my knowledge DDSketches can't be currently sent directly through our public metrics submission API.

I still don't understand from the original report at https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17373#issuecomment-1403451116 why this would only happen with the new feature gate, given the code for sketches has been left unchanged, but it makes sense to me that this would fail since Vector is trying to decode a different kind of payload from the one we send.

neuronull commented 1 year ago

The data sent by the Datadog exporter (and the Datadog Agent) to the /api/beta/sketches endpoint uses this proto file which predates and is diffferent from the DDSketch proto format

Hmmm ... Vector is receiving the request on the endpoint /api/v2/series πŸ€”

We do support the /api/beta/sketches endpoint, but that is not the code path taken in this scenario.

Is it possible that the exporter is sending a payload for the beta/sketches endpoint to the v2/series endpoint ?

mx-psi commented 1 year ago

Hmmm ... Vector is receiving the request on the endpoint /api/v2/series thinking

My bad, I assumed you were talking about the other endpoint based on the Error decoding Datadog sketch error message, since we don't send sketches to /api/v2/series. Our exporter sends gauges and counts to /api/v2/series but not sketches (I don't think they are even supported on that endpoint). The payload format is defined here (MetricsSeries is defined here), i.e. the format that we publicly document on our docs. Vector's equivalent seems to be this one (edit: nope, it's actually this one).

Re-reading OP's example they send a statsd count that will result in a Datadog count, so it makes sense a payload is sent to /api/v2/series.

Comparing the two schemas and re-reading our code, one difference is the resources field. This field is defined on the datadog-api-client-go library and we set it explicitly to specify the host, while I don't see that field on Vector's definition of the payload (edit: it's there on the right proto, so it's probably not this). If that was related I would expect the examples here to make this Vector source fail too, that's something we can try out.

I also noticed that the error here talks about sketches while decoding an /api/v2/series payload. I think that is also wrong and explains the confusing error message.

@neuronull does this make sense to you as a possible explanation of the error? Would you mind trying to reproduce with one of the example on our docs to see if you get the same error?

neuronull commented 1 year ago

Ah... I definitely see where that error message is misleading, now that you point it out. I suspect it was copy-pasted and not updated. Regardless of the end result of this investigation, I'll make sure to PR an updated error message to dispel that confusion.

I took a look at those submission examples you pointed to. I do agree that the resources field looks correct.

What seems to be happening here, at least from what I can tell, is that Vector is only capable of receiving protobuf encoded data on the v2/series endpoint.

Those links to the Datadog submission examples, are sending JSON. Indeed, when I tried those examples to be ingested to Vector, similar results occurred. If I omitted the resources, a more clear error was displayed from the attempt to deserialize the HTTP body as the protobuf messages:

{"code":422,"message":"Error decoding Datadog sketch: DecodeError { description: \"invalid wire type value: 6\", stack: [] }"}

That's because the HTTP body is text and not protobuf encoded, and that tracks with code inspection.

The origin of the support for the v2 metrics endpoint is this PR: https://github.com/vectordotdev/vector/pull/13028

, and it is not clear from that PR, whether there are any plans to support JSON formatted inputs. I also searched open issues in the Vector repo and found nothing related to this.

Generally the use case has been [...] -> [Datadog Agent] -> [Vector] -> [Datadog backend]

It seems this use case is attempting to make Vector a "drop in" replacement for the Datadog Agent in this pipeline:

https://datadog-docs.imgix.net/images/metrics/otel/otlp_ingestion_update.ee6fea7e7906ea4946d2945c6f9c5a52.png

What I'm suspecting is that the Datadog Agent outputs protobuf encoded payloads to the metrics v2 endpoint, and that Vector's datadog_agent source was modeled after the consumption of that (like the Datadog backend). Whereas this use case assumes that Vector is capable of ingesting incoming data synonymously with how the Agent is.

I don't yet understand if this is a requirement for Vector to have that functionality.

Also noting that we don't have robust testing on the v2 metrics ingestion (https://github.com/vectordotdev/vector/issues/15108)

/ cc @jszwedko

mx-psi commented 1 year ago

The protobuf explanation makes sense to me, I think that explains this. Based on this function it looks like the Agent uses protobuf when sending to the v2 API endpoint.

Unfortunately, AFAICT the official Datadog API client only supports sending metrics as JSON (Content-Type is always set to application/json), so from our side there's not much we can do. I reached out to the team responsible for datadog-api-client-go to confirm that sending metrics as protobuf is not supported and asked whether it will be in the future.

neuronull commented 1 year ago

Thanks for confirming that about the Agent using protobuf.

I did confirm with others on the Vector team that my suspicion about the design of the datadog_agent source is meant to mimic that of the Datadog backend.

There are a few possibilities that we could do in Vector to address this. We will raise that and discuss internally.

sean-snyk commented 1 year ago

Generally the use case has been [...] -> [Datadog Agent] -> [Vector] -> [Datadog backend]

To clarify: Our usecase is [...] -> [OTel Collector] -> [Vector] -> [Datadog backend] Since the otel collector has the capability to read Prometheus ServiceMonitors. We standardized on the otel collector for ingest of other protocols too as vector doesn't support enhancing metrics with additional k8s metadata.

jszwedko commented 1 year ago

Small update: we discussed this and believe the best approach to support this use-case is to have the opentelemetry source accept metrics so that the native OpenTelemetry exporter can be used. The options considered were:

  1. Have the datadog_agent source accept more types of payloads than the Datadog Agent sends, including those sent by the OpenTelemetry Datadog exporter. We decided this wasn't in the spirit of the datadog_agent source which is only meant to be used to receive data from the Datadog Agent itself and would over complicate its implementation.
  2. Have a new datadog_metrics_api source that is intended to match the Datadog metrics ingest API. This should be compatible with the OpenTelemetry Datadog exporter.
  3. Have the opentelemetry source accept metrics so that the native OTLP exporter can be used.

Option 3 seemed the best to us since we wanted to add this support anyway. Curious to hear any additional feedback though.