vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.07k stars 1.48k forks source link

`datadog_agent` source is not processing V2 API payload from Datadog agent accurately. #18690

Open rpriyanshu9 opened 9 months ago

rpriyanshu9 commented 9 months ago

A note for the community

Problem

Hey there,

After upgrading Datadog agent from 7.39.2 to 7.45.0, we observed that some metrics which use the device tag stopped coming. We further investigated this and found out that the device tag was renamed to resource.device after the upgrade. This resulted in many dashboards being empty and monitors going off in Datadog. We had to revert the upgrade for fixing this issue. We looked into the source code of Datadog agent and Vector to find the root cause of this issue.

Here's what we think is causing this:

Starting from the 7.43.2 release of Datadog agent, the device tag was sent as a part of resources array : https://github.com/DataDog/datadog-agent/pull/16264.

The datadog_agent source acknowledges the V2 API payload with the resources field, but does not handle the tags that are sent as a part of resources and not the tags array. ref: https://github.com/vectordotdev/vector/blob/53cad38db12ceb11e0394b4d5906f7de541ec7dc/src/sources/datadog_agent/metrics.rs#L270-L281

Because of the above block of code, the device tag that comes as an element of resources gets remapped to resource.device by the datadog_agent source. Because of this remapping, the metrics sent out by the datadog_metrics sink have the resource.device tag which is incorrect. It should be device only.

Seeking assistance in resolving this issue.

Discord thread: https://discord.com/channels/742820443487993987/1155850005391880214

cc @datsabk @jszwedko

Configuration

    api:
      address: 0.0.0.0:8686
      enabled: true
      playground: false
    data_dir: /data/vector
    sinks:
      datadog_metrics:
        batch:
          max_bytes: 512000
        buffer:
          max_events: 10000
          type: memory
          when_full: block
        default_api_key: ${DD_API_KEY}
        inputs:
        - modify_tags_for_datadog
        type: datadog_metrics
    sources:
      datadog_agent:
        address: 0.0.0.0:8282
        disable_logs: true
        disable_traces: true
        multiple_outputs: false
        store_api_key: true
        type: datadog_agent
      vector_source:
        address: 0.0.0.0:9000
        type: vector
    transforms:
      filter_for_datadog:
        condition:
          source: "true"
          type: vrl
        inputs:
        - datadog_agent
        - vector_source
        type: filter
      modify_tags_for_datadog:
        inputs:
        - filter_for_datadog
        source: |-
          del(.tags."k2.version")
          del(.tags."k2.skip_checks")
          del(.tags.container_id)
          del(.tags.display_container_name)
          del(.tags."git.commit.sha")
          del(.tags.kube_ownerref_name)
          del(.tags.kube_replica_set)
          del(.tags."io.kubernetes.pod.uid")
          del(.tags.image_id)
        type: remap

Version

vector 0.30.0

Debug Output

No response

Example Data

{
    "metric": {
        "name": "disk.in_use",
        "namespace": "system",
        "tags": {
            "device_name": "loop0",
            "host": "i-03fe32ac191d77928",
            "resource.device": "/dev/loop0",
            "source_type_name": "System"
        },
        "timestamp": "2023-09-26T15:25:14Z",
        "kind": "absolute",
        "gauge": {
            "value": 0.192
        }
    }
}

Additional Context

No response

References

No response

neuronull commented 9 months ago

👋 Thanks for the thorough report and analysis here. After reviewing everything, I agree with the consensus.

Basically the datadog_metrics sink encoder is encoding without the knowledge of the fact that the v2 parser in the agent source has namespaced the device to resource.device. Since we want to handle both the v1 and v2 endpoints , the sink encoder should check for the presence of both. Alternatively the agent source could be consistent in whether or not to namespace it.

Relatedly, this is the type of behavior we will want to test in the end-to-end test cases for the Datadog components that is in progress. I will link this issue there.

In the meantime, I believe this could be worked around by configuring the Agent to send on the v1 series endpoint instead of using the default of v2. This will mean Vector uses the parsing for v1, which doesn't namespace the device. Another workaround could be to have a transform that intercepts and removes the namespace for that tag.

jszwedko commented 9 months ago

For the workaround, to configure the Agent to use the v1 endpoint you can set use_v2_api.series: false in the Agent configuration file (or set DD_USE_V2_API_SERIES=false).

neuronull commented 9 months ago

Another thought- there are in progress changes to migrate the datadog_metrics sink to send to the v2 series endpoint. In those changes, I'm handling the case for this discrepancy in the source's decoding. Essentially, once that is merged in, this issue should also be resolved.

rpriyanshu9 commented 9 months ago

For the workaround, to configure the Agent to use the v1 endpoint you can set use_v2_api.series: false in the Agent configuration file (or set DD_USE_V2_API_SERIES=false).

Yeah for now we're using this variable to get past the issue. BTW it's the datadog_agent source, which is at fault, right?

neuronull commented 8 months ago

👋 this issue was addressed in https://github.com/vectordotdev/vector/pull/18761 , which is included in the recent v0.34.0 release.

neuronull commented 8 months ago

Re-openening since v0.34.1 will contain #19138 , which reverts to the v1 behavior.

rpriyanshu9 commented 3 months ago

Hi @neuronull @jszwedko, are there any updates on this issue?

jszwedko commented 3 months ago

Hi @neuronull @jszwedko, are there any updates on this issue?

No updates unfortunately; I believe this issue still exists. The fix we'd like to do is to switch the datadog_metrics sink to using the /v2 metrics API.