moleculerjs / moleculer

:rocket: Progressive microservices framework for Node.js
https://moleculer.services/
MIT License
6.16k stars 585 forks source link

Moleculer Datadog integration is not working #725

Closed sambaran69 closed 1 year ago

sambaran69 commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Current Behavior

I should be able to get traces from my moleculer services on Datadog APM. I am using the latest version of Moleculer.

Expected Behavior

I do not get any Trace information. My moleculer service logs reports 400 bad request error. Correspondingly in the Datadog agent logs there are error messages stating trace payload could not be decoded.

Failure Information

The agent is version is v6.18.0 and the tracer version is 0.19.1.

I got the following in the moleculer service log:

Request to the agent: {"path":"/v0.4/traces","method":"PUT","headers":{"Content-Type":"application/msgpack","Datadog-Meta-Tracer-Version":"0.19.1","X-Datadog-Trace-Count":"2","Datadog-Meta-Lang":"nodejs","Datadog-Meta-Lang-Version":"v12.16.1","Datadog-Meta-Lang-Interpreter":"v8"},"protocol":"http:","hostname":"10.101.67.78","port":"8126"}
Error: Error from the agent: 400 Bad Request
    at IncomingMessage.<anonymous> (/app/node_modules/dd-trace/packages/dd-trace/src/platform/node/request.js:35:23)
    at IncomingMessage.emit (events.js:323:22)
    at IncomingMessage.EventEmitter.emit (domain.js:482:12)
    at endReadableNT (_stream_readable.js:1204:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  status: 400
}

I am getting the following in the Datadog agent log:

2020-04-23 03:31:25 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-04-23 03:31:33 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-04-23 03:31:35 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-04-23 03:31:43 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-04-23 03:31:45 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-04-23 03:31:53 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-04-23 03:31:55 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-04-23 03:32:03 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-04-23 03:32:05 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"

Steps to Reproduce

Please provide detailed steps for reproducing the issue.

  1. I have created two env variables for the container running my moleculer service: DD_AGENT_HOST and DD_AGENT_URL. The service is able to connect to Datadog agent and sent tracing data.

  2. I am initialising the service as:

    "tracing": {
      "enabled": true,
      "events": true,
      "stackTrace": true,
      "exporter": {
        "type": "Datadog",
        "options": {
          "tracerOptions": {
            "analytics": true,
            "debug": true,
            "enabled": true,
            "service": "my-service"
          }
        }
      }
    }

Reproduce code snippet

    "tracing": {
      "enabled": true,
      "events": true,
      "stackTrace": true,
      "exporter": {
        "type": "Datadog",
        "options": {
          "tracerOptions": {
            "analytics": true,
            "debug": true,
            "enabled": true,
            "service": "my-service"
          }
        }
      }
    }

Context

Please provide any relevant information about your setup. This is important in case the issue is not reproducible except for under certain conditions.

Failure Logs

See above
lucduong commented 4 years ago

Here is my configuration for Datadog tracing. But couldn't send tracing to Datadog APM

{
  "enabled": true,
  "stackTrace": true,
  "exporter": {
    "type": "Datadog",
    "options": {
      "agentUrl": "http://10.104.0.7:8126",
      "env": null,
      "samplingPriority": "AUTO_KEEP",
      "defaultTags": [
        "recruit"
      ],
      "tracerOptions": {
        "analytics": true
      }
    }
  }
}

I installed Datadog by using helm chart with bellow values

datadog:
  apiKey: KEY
  logLevel: INFO
  kubeStateMetricsEnabled: true
  clusterChecks:
    enabled: false
  dogstatsd:
    port: 8125
    useHostPort: true
    nonLocalTraffic: true
  logs:
    enabled: true
    containerCollectAll: true
  apmEnabled: true
  apm:
    enabled: true
    port: 8126
...
ngraef commented 4 years ago

I'm seeing this issue too. I filed https://github.com/DataDog/dd-trace-js/issues/970, but after more investigation I believe the core problem is moleculer improperly setting IDs on the dd-trace spans. dd-trace's MessagePack encoder expects the traceId and spanId to be 8 bytes, because that's what the id generator defaults to. However, moleculer overrides the dd-trace span's generated IDs with the 16-byte GUIDs used by moleculer's internal tracer. The extra bytes aren't handled in the MessagePack encoding and it ends up corrupting the trace payload, resulting in decoding errors on the agent.

In addition to corrupting the trace payload, it seems to also corrupt some internal state in dd-trace because future trace payloads will also be corrupt, even if their ids are the expected 8 bytes. I haven't yet gone far enough to figure out why this happens.

(Related: #690. Perhaps these issues can be solved at the same time.)

icebob commented 4 years ago

Yeah, it can be a problem, btw these 16-byte IDs work earlier. The main problem is that dd-trace has no exported method to create span_id & trace_id from an existing UUID and the private ID generation method in dd_trace always changes between versions. And I can't always rewrite this part in the exporter... I mean, I can fix it for the current latest version of dd-trace, but it doesn't mean it will work in the future. The final solution would be if dd-trace has method for this function.

icebob commented 4 years ago

@ngraef could you try it with the master, please? It shows tracing for me image

rochdev commented 4 years ago

The final solution would be if dd-trace has method for this function.

I don't think this covers everything. Even if we were to expose everything that is needed by Moleculer, how will spans created by Moleculer interact with spans created by Datadog? For example, we usually automatically set the parent of a span based on the span that is currently active in the current execution context. This can only be done if spans are created while they are active, and not after the fact from external tracing metadata.

As a more concrete example, say you have service A and service B. Service A calls service B using HTTP. What I would expect to see, is a trace with a moleculer.request span as its root, which would have a http.request client span as its child, which itself would have a moleculer.request child from service B. Right now, the 2 moleculer.request spans would be created by Moleculer, and the http.request client span would be created by Datadog. How can these be linked together properly during execution?

In general we solve this by adding our own integration which replaces the built-in tracing functionality of the framework. I wonder how we could potentially instead achieve interoperability.

(please correct any of the above if incorrect since I'm not very familiar with Moleculer)

icebob commented 4 years ago

@rochdev Yeah, this is currently a problem. The moleculer spans & datadog spans are not connected. The problem is that Moleculer is a full complex framework with a built-in tracing solution which monitors all services (actions, events, custom spans...etc). So you can't create a dd-agent instrument for this, similar for express or redis where sufficient wrap some methods and the tracing will work.

In previous versions (maybe dd-trace 0.16) I've tried to make the connection between moleculer & datadog spans and it worked. I checked the full dd-trace source code and found the way how could I get the active span and how could I overwrite it from Moleculer (it was a dirty way because it uses private variables in dd-trace). But in the later dd-trace versions this internal things has been changed and this code stopped working. Here is a full trace from those times: You can see that the first span comes from DD-trace (http instrument) and there are Redis dd-trace spans between Moleculer spans.

In brief, I can solve it, if dd-trace would be two public methods. One method to get the current datadog span (trace_id & span_id & parent_span_id) based on the current executionAsyncId, and another method to set the current span (span_id & parent_span_id). Is there any chance that you can create them inside dd-trace?

rideddy84 commented 4 years ago

I have the same problems. If making a perfect fix for this isn't easy, it would be better not producing unhandled exception.

Encoding trace: [{"trace_id":"1c1f719b2c0a3898","span_id":"1c1f719b2c0a3898","parent_id":"0000000000000000","name":"redis.command","resource":"brpoplpush","error":0,"meta":{"service":"moleculer","version":"0.14.7","span.kind":"client","db.type":"redis","out.host":"localhost","redis.raw_command":"BRPOPLPUSH bull:users.retirement:wait bull:users.retirement:active 5"},"metrics":{"_sample_rate":1,"db.name":1,"out.port":6379,"_dd.agent_psr":1,"_sampling_priority_v1":1},"start":1591221956761919000,"duration":5559285889,"service":"moleculer-redis","type":"redis"}]
Added encoded trace to buffer: 91 8c a8 74 72 61 63 65 5f 69 64 cf 1c 1f 71 9b 2c 0a 38 98 88 ea 60 27 59 30 29 9a cf 1c 1f 71 9b 2c 0a 38 98 88 ea 60 27 59 30 29 16 15 27 f0 63 98 9e 00 a8 64 75 72 61 74 69 6f 6e d3 00 00 00 01 4b 5b f8 81 a5 65 72 72 6f 72 00 a9 70 61 72 65 6e 74 5f 69 64 cf 00 00 00 00 00 00 00 00 a4 6e 61 6d 65 ad 72 65 64 69 73 2e 63 6f 6d 6d 61 6e 64 a8 72 65 73 6f 75 72 63 65 aa 62 72 70 6f 70 6c 70 75 73 68 a7 73 65 72 76 69 63 65 af 6d 6f 6c 65 63 75 6c 65 72 2d 72 65 64 69 73 a4 74 79 70 65 a5 72 65 64 69 73 a4 6d 65 74 61 86 a7 73 65 72 76 69 63 65 a9 6d 6f 6c 65 63 75 6c 65 72 a7 76 65 72 73 69 6f 6e a6 30 2e 31 34 2e 37 a9 73 70 61 6e 2e 6b 69 6e 64 a6 63 6c 69 65 6e 74 a7 64 62 2e 74 79 70 65 a5 72 65 64 69 73 a8 6f 75 74 2e 68 6f 73 74 a9 6c 6f 63 61 6c 68 6f 73 74 b1 72 65 64 69 73 2e 72 61 77 5f 63 6f 6d 6d 61 6e 64 d9 44 42 52 50 4f 50 4c 50 55 53 48 20 62 75 6c 6c 3a 75 73 65 72 73 2e 72 65 74 69 72 65 6d 65 6e 74 3a 77 61 69 74 20 62 75 6c 6c 3a 75 73 65 72 73 2e 72 65 74 69 72 65 6d 65 6e 74 3a 61 63 74 69 76 65 20 35 a7 6d 65 74 72 69 63 73 85 ac 5f 73 61 6d 70 6c 65 5f 72 61 74 65 cb 3f f0 00 00 00 00 00 00 a7 64 62 2e 6e 61 6d 65 cb 3f f0 00 00 00 00 00 00 a8 6f 75 74 2e 70 6f 72 74 cb 40 b8 eb 00 00 00 00 00 ad 5f 64 64 2e 61 67 65 6e 74 5f 70 73 72 cb 3f f0 00 00 00 00 00 00 b5 5f 73 61 6d 70 6c 69 6e 67 5f 70 72 69 6f 72 69 74 79 5f 76 31 cb 3f f0 00 00 00 00 00 00
Request to the agent: {"path":"/v0.4/traces","method":"PUT","headers":{"Content-Type":"application/msgpack","Datadog-Meta-Tracer-Version":"0.21.0","X-Datadog-Trace-Count":"2","Datadog-Meta-Lang":"nodejs","Datadog-Meta-Lang-Version":"v12.16.1","Datadog-Meta-Lang-Interpreter":"v8"},"protocol":"http:","hostname":"localhost","port":"8126"}
Inject into carrier: {"x-datadog-trace-id":"5204814146905357208","x-datadog-parent-id":"5204814146905357208","x-datadog-sampling-priority":"-1"}.
Error: Error from the agent: 400 Bad Request
    at IncomingMessage.<anonymous> (/www/poomgo_backend/node_modules/dd-trace/packages/dd-trace/src/platform/node/request.js:35:23)
    at /www/poomgo_backend/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:54:19
    at Scope._activate (/www/poomgo_backend/node_modules/dd-trace/packages/dd-trace/src/scope/async_hooks.js:51:14)
    at Scope.activate (/www/poomgo_backend/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:12:19)
    at IncomingMessage.bound (/www/poomgo_backend/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:53:20)
    at IncomingMessage.emit (events.js:323:22)
    at endReadableNT (_stream_readable.js:1204:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  status: 400
}
$ tail -f /var/log/datadog/trace-agent.log
2020-06-04 07:25:23 KST | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
rochdev commented 4 years ago

@icebob Any way you can simply create spans using the existing API? Unless I'm misunderstanding something I think our public API already offers what you need, as long as it's possible to let dd-trace pick the IDs.

icebob commented 4 years ago

@rochdev could you show example how can I do that with dd-trace?

rochdev commented 4 years ago

@icebob You mentioned several operations that are traced above. If you can link me to the code of one of them and also to how it's currently traced I can try to do it using only our public interface if at all possible. If not, it will at least give me a better idea of how tracing works in Moleculer and what could be done to improve the interoperability.

icebob commented 4 years ago

@rochdev Great! In Moleculer there are actions and events. Both of them traced. Both have a Context instance which contains data for tracing, as well (id, parentID, requestID).

The built-in tracing feature is created as a middleware which can wrap the internal methods and can add additional functions. Here is the source of the tracing middleware

It wraps the service action & event handlers and generates tracing span based on the Context. It calls the ctx.startSpan and ctx.finishSpan methods which are a shorthand for the Tracer startSpan & span.finish

The Tracer has exporters, it can be multiple. One of them the Datadog exporter which uses the dd-trace to convert the Moleculer traces & spans to Datadog traces & spans.

The developers can create custom spans as well. For this they call the ctx.startSpan()/ctx.finishSpan() methods. Or if Context is not available, broker.tracer.startSpan()/span.finish()

When a request goes from NodeA to NodeB via transporter, the trace info is carried with the Context to the remote node. All nodes collect the spans locally and send to the exporters

Here is the simple generated demo where you can play with Moleculer: https://github.com/moleculerjs/project-demo

If you need any more info, just ask me.

ngraef commented 4 years ago

Thanks for that great overview, @icebob!

@rochdev Datadog is a crucial part of my team's operations. Any insight you can provide to help us improve the moleculer integration is greatly appreciated.

mailaneel commented 4 years ago

datadog now has 2 methods for this to activate current spantracer.scope().activate(span) to get current parent spantrace.scope().active()

https://github.com/DataDog/dd-trace-js/blob/353ec0905b6dcad53aa1a29ead6e6d880f8a4065/packages/dd-trace/src/scope/async_hooks.js

https://github.com/DataDog/dd-trace-js/blob/353ec0905b6dcad53aa1a29ead6e6d880f8a4065/packages/dd-trace/src/scope/base.js

@icebob

mailaneel commented 4 years ago

Update:

I have tried this by doing activate in tracing middleware just before calling the actual handler and scope propagates fine. I am not sure yet how to do it for all of the exporters

datadog.js (exporter)

span.meta.datadog = {
    span: ddSpan,
    ddTracer: this.ddTracer,
    asyncId,
    oldSpan
};

tracer.js (middleware)

span.meta.datadog.ddTracer.scope().activate(span.meta.datadog.span, () => {
    // Call the handler
    return handler(ctx).then(res => {
        const tags = {
            fromCache: ctx.cachedResult
        };

        if (_.isFunction(opts.tags)) {
            const r = opts.tags.call(ctx.service, ctx, res);
            if (r)
                Object.assign(tags, r);

        } else if (_.isPlainObject(opts.tags)) {
            if (opts.tags.response === true)
                tags.response = res != null ? Object.assign({}, res) : res;
            else if (Array.isArray(opts.tags.response))
                tags.response = _.pick(res, opts.tags.response);
        }

        span.addTags(tags);
        ctx.finishSpan(span);

        //ctx.duration = span.duration;

        return res;
    }).catch(err => {
        span.setError(err);
        ctx.finishSpan(span);

        throw err;
    });
})
Screen Shot 2020-07-10 at 1 59 35 am

CC: @icebob

mailaneel commented 4 years ago

also some open standards in works: https://github.com/open-telemetry/opentelemetry-js