Open sean-snyk opened 1 year ago
I reproduced this earlier today and confirmed.
Any updates on this @davidhuie-dd?
Any updates on this @davidhuie-dd?
π I am starting to look into this today
I confirmed that vector is attempting to parse the incoming request against the metrics v2 endpoint.
The failure we are observing:
unable to process request (internal error) status=500 error=Some(Rejection(ErrorMessage { code: 422, message: "Error decoding Datadog sketch: DecodeError { description: \"unexpected end group tag\", stack: [] }" }))
, is occurring while attempting to deserialize the incoming request, which we expect to be a gRPC frame.
The part unexpected end group tag
, appears to come from protobuf land.
I checked the proto file Vector is using (https://github.com/vectordotdev/vector/blob/master/proto/ddsketch_full.proto) , against where it was sourced (https://github.com/DataDog/sketches-go/blob/0a92170/ddsketch/pb/ddsketch.proto) , and that all looks good.
So, I'm left to wonder if the incoming data is correctly structured.
The data sent by the Datadog exporter (and the Datadog Agent) to the /api/beta/sketches
endpoint uses this proto file which predates and is diffferent from the DDSketch proto format. To my knowledge DDSketches can't be currently sent directly through our public metrics submission API.
I still don't understand from the original report at https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17373#issuecomment-1403451116 why this would only happen with the new feature gate, given the code for sketches has been left unchanged, but it makes sense to me that this would fail since Vector is trying to decode a different kind of payload from the one we send.
The data sent by the Datadog exporter (and the Datadog Agent) to the /api/beta/sketches endpoint uses this proto file which predates and is diffferent from the DDSketch proto format
Hmmm ... Vector is receiving the request on the endpoint /api/v2/series
π€
We do support the /api/beta/sketches
endpoint, but that is not the code path taken in this scenario.
Is it possible that the exporter is sending a payload for the beta/sketches
endpoint to the v2/series
endpoint ?
Hmmm ... Vector is receiving the request on the endpoint /api/v2/series thinking
My bad, I assumed you were talking about the other endpoint based on the Error decoding Datadog sketch
error message, since we don't send sketches to /api/v2/series
. Our exporter sends gauges and counts to /api/v2/series
but not sketches (I don't think they are even supported on that endpoint). The payload format is defined here (MetricsSeries
is defined here), i.e. the format that we publicly document on our docs. Vector's equivalent seems to be this one (edit: nope, it's actually this one).
Re-reading OP's example they send a statsd count that will result in a Datadog count, so it makes sense a payload is sent to /api/v2/series
.
Comparing the two schemas and re-reading our code, one difference is the resources
field. This field is defined on the datadog-api-client-go library and we set it explicitly to specify the host, while I don't see that field on Vector's definition of the payload (edit: it's there on the right proto, so it's probably not this). If that was related I would expect the examples here to make this Vector source fail too, that's something we can try out.
I also noticed that the error here talks about sketches while decoding an /api/v2/series
payload. I think that is also wrong and explains the confusing error message.
@neuronull does this make sense to you as a possible explanation of the error? Would you mind trying to reproduce with one of the example on our docs to see if you get the same error?
Ah... I definitely see where that error message is misleading, now that you point it out. I suspect it was copy-pasted and not updated. Regardless of the end result of this investigation, I'll make sure to PR an updated error message to dispel that confusion.
I took a look at those submission examples you pointed to. I do agree that the resources
field looks correct.
What seems to be happening here, at least from what I can tell, is that Vector is only capable of receiving protobuf encoded data on the v2/series
endpoint.
Those links to the Datadog submission examples, are sending JSON. Indeed, when I tried those examples to be ingested to Vector, similar results occurred. If I omitted the resources, a more clear error was displayed from the attempt to deserialize the HTTP body as the protobuf messages:
{"code":422,"message":"Error decoding Datadog sketch: DecodeError { description: \"invalid wire type value: 6\", stack: [] }"}
That's because the HTTP body is text and not protobuf encoded, and that tracks with code inspection.
The origin of the support for the v2 metrics endpoint is this PR: https://github.com/vectordotdev/vector/pull/13028
, and it is not clear from that PR, whether there are any plans to support JSON formatted inputs. I also searched open issues in the Vector repo and found nothing related to this.
Generally the use case has been [...] -> [Datadog Agent] -> [Vector] -> [Datadog backend]
It seems this use case is attempting to make Vector a "drop in" replacement for the Datadog Agent in this pipeline:
What I'm suspecting is that the Datadog Agent outputs protobuf encoded payloads to the metrics v2 endpoint, and that Vector's datadog_agent
source was modeled after the consumption of that (like the Datadog backend). Whereas this use case assumes that Vector is capable of ingesting incoming data synonymously with how the Agent is.
I don't yet understand if this is a requirement for Vector to have that functionality.
Also noting that we don't have robust testing on the v2 metrics ingestion (https://github.com/vectordotdev/vector/issues/15108)
/ cc @jszwedko
The protobuf explanation makes sense to me, I think that explains this. Based on this function it looks like the Agent uses protobuf when sending to the v2 API endpoint.
Unfortunately, AFAICT the official Datadog API client only supports sending metrics as JSON (Content-Type
is always set to application/json
), so from our side there's not much we can do. I reached out to the team responsible for datadog-api-client-go
to confirm that sending metrics as protobuf is not supported and asked whether it will be in the future.
Thanks for confirming that about the Agent using protobuf.
I did confirm with others on the Vector team that my suspicion about the design of the datadog_agent
source is meant to mimic that of the Datadog backend.
There are a few possibilities that we could do in Vector to address this. We will raise that and discuss internally.
Generally the use case has been
[...] -> [Datadog Agent] -> [Vector] -> [Datadog backend]
To clarify:
Our usecase is [...] -> [OTel Collector] -> [Vector] -> [Datadog backend]
Since the otel collector has the capability to read Prometheus ServiceMonitors.
We standardized on the otel collector for ingest of other protocols too as vector doesn't support enhancing metrics with additional k8s metadata.
Small update: we discussed this and believe the best approach to support this use-case is to have the opentelemetry
source accept metrics so that the native OpenTelemetry exporter can be used. The options considered were:
datadog_agent
source accept more types of payloads than the Datadog Agent sends, including those sent by the OpenTelemetry Datadog exporter. We decided this wasn't in the spirit of the datadog_agent
source which is only meant to be used to receive data from the Datadog Agent itself and would over complicate its implementation.datadog_metrics_api
source that is intended to match the Datadog metrics ingest API. This should be compatible with the OpenTelemetry Datadog exporter.opentelemetry
source accept metrics so that the native OTLP exporter can be used.Option 3 seemed the best to us since we wanted to add this support anyway. Curious to hear any additional feedback though.
A note for the community
Problem
We're collecting metrics with the opentelemetry-collector and sending them to vector using Datadog's submission protocol via the datadogexporter
The datadogexporter was recently (v0.69.0) upgraded to use native Datadog client libraries https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/16776 This results in vector rejecting the export requests with 422 Unprocessable Entity
Configuration
Version
vector 0.27.0 (x86_64-unknown-linux-gnu 5623d1e 2023-01-18)
Debug Output
Additional Context
support for OTLP ingest of metrics would also be very much appreciated
References
https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17373#issuecomment-1403451116