jgaskins / datadog

Datadog client for APM tracing and metrics in Crystal
MIT License
13 stars 2 forks source link

Leaking/Masking Datadog resource names #1

Open mjeffrey18 opened 2 years ago

mjeffrey18 commented 2 years ago

hey @jgaskins, thanks for this awesome shard!

I'm having a few issues putting together the data on DD. Namely the primary span.resource keeps showing up in the redis/PG integrations and also the main web resource is showing their spans.

Here is my service layout

bonsai-ws has the following 2 main traces

Everything works fine when I don't have the integrations setup (well except for the fact I have no data from redis/postgres).

But generally, I'm seeing traces from both of the trace names, all split into controllers or channels.

Like this example;

Screenshot 2022-05-14 at 20 53 40

Here is how I have the setup at the moment;

# MonkeyPatch to report errors

require "../lib/datadog/src/integrations"
require "../lib/datadog/src/integrations/db"
require "../lib/datadog/src/integrations/redis"

module Datadog
  class Span
    Log = Lucky::Log.dexter
  end

  class DefaultTracer
    def report(ex)
      ExceptionService.notify(ex)
    end
  end
end

Datadog.configure do |c|
  c.service ENV["DD_SERVICE"], type: "web"

  # Define tags you want on every span
  c.tags = Datadog::Span::Metadata{
    "env"      => ENV["DD_ENV"]? || ENV["APP_NAME"],
    "version"  => ENV["DD_VERSION"],
    "pod_name" => ENV["HOSTNAME"], # example "bonsai-terminal-8486d85df7-bd6dx"
  }

  c.agent_host = ENV["DD_AGENT_HOST"]
  # The shard seems to have all the defaults we need
  # c.trace_agent_port = ENV["DD_TRACE_AGENT_PORT"]
  # c.metrics_agent_port = ENV["DD_METRICS_AGENT_PORT"]
end

bonsai_primary = Datadog::Integrations::DB.new(
  service_name: "bonsai-primary",
  host: AppDatabase.settings.credentials.hostname,
  db: AppDatabase.settings.credentials.database
)

bonsai_sync = Datadog::Integrations::DB.new(
  service_name: "bonsai-sync",
  host: SyncDatabase.settings.credentials.hostname,
  db: SyncDatabase.settings.credentials.database
)

redis_primary_with_pool = Datadog::Integrations::Redis.new(
  service_name: "redis-primary",
  uri: URI.parse(Config.settings.redis_url + Config.settings.redis_pooling_query)
)

redis_primary = Datadog::Integrations::Redis.new(
  service_name: "redis-primary",
  uri: URI.parse(Config.settings.redis_url)
)

Datadog.configure do |c|
  c.use bonsai_primary
  c.use bonsai_sync
  c.use redis_primary
  c.use redis_primary_with_pool
end

Note: I put those integrations in a separate block as it was even worse before when I had them in the same block for some reason.

With the integrations live

Notice how some of the traces now don't have a resource name, instead they are using the trace name. This seem to be only if x trace has sub queries, i.e. redis or postgres is being traced are children.

Screenshot 2022-05-14 at 20 19 35 Screenshot 2022-05-14 at 20 32 21

Websocket example

Now that the integrations are live, the previously call ChangesChannel resource has been replaced by the trace name since it has some sub spans

Screenshot 2022-05-14 at 20 34 48 Screenshot 2022-05-14 at 20 34 59 Screenshot 2022-05-14 at 20 35 15

Redis service is fine

Screenshot 2022-05-14 at 20 33 43

This leads me to believe there is something going on with the parent/active spans, perhaps the redis/postgres spans are override the resource name of the main span.

However, I couldn't find anything obvious in the code.

Have you found this an issue or do I have the wrong setup?

jgaskins commented 2 years ago

Interesting. I've seen something like this before, but could never figure out why it was happening and it only happens every so often, not as consistently as you seem to be seeing it. For example, here's what I'm seeing:

Screenshot of Datadog trace list for my web service

I'm not using Lucky, which might be part of it. What is calling Datadog.tracer.trace "http.request"?

jgaskins commented 2 years ago

I'm not using Lucky, which might be part of it

By this, I mean I may not be running into the same things you are because I'm wiring it up differently. Not that Lucky is the problem. 😄

mjeffrey18 commented 2 years ago

There is one trick I'm having to do regarding reporting the http status code, but it doesn't seem to change anything with the endpoints/websocket channels without redis or postgres. Only the endpoints with the redis and postgres spans nested seem to have the issue.

I implemented the http.request trace like this, using a custom http handler middleware

image

This is the trace block

image

For the websocket trace, i'm just wrapping the subscribe method using a similar way

image

jgaskins commented 2 years ago

Hmm, that seems pretty similar to what I'm doing:

# Included into individual routes to inspect the path taken during routing,
# including into nested routes.
module Traceable
  def trace(*args, **kwargs)
    Datadog.tracer.trace(*args, **kwargs) { |span| yield span }
  end

  # Called in route blocks to override the default "#{request.method} #{request.resource}"
  # value set in DatadogHandler
  def set_resource(resource : String)
    if span = Datadog.tracer.active_span
      span.resource = resource
    end
  end

  def set_tag(key : String, value : String)
    if span = Datadog.tracer.active_span
      span.tags[key] = value
    end
  end
end

# Added to HTTP::Server middleware
class DatadogHandler
  include HTTP::Handler
  include Traceable

  def call(context)
    request = context.request

    resource = "#{request.method} #{request.resource}"
    tags = Datadog::Span::Metadata {
      "http.method" => request.method,
      "http.resource" => request.resource,
      "http.host" => request.hostname || "",
      "http.user_agent" => request.headers["User-Agent"]? || "",
    }

    trace "http.server.request", type: "http", resource: resource, tags: tags do |span|
      call_next context
      span.tags["http.status_code"] = context.response.status_code.to_s
      span.tags["http.response.content_type"] = context.response.headers["Content-Type"]? || ""
    end
  end
end

And then in my routes:

struct Inbox
  # ...
  include Traceable

  def initialize(@current_user : User)
  end

  def call(context)
    route context do |r, response, session|
      set_resource "Inbox"
      # ...
    end
  end
end

The fact that I'm calling set_resource in my routes may be part of why I'm not seeing this, but I'm not sure.

mjeffrey18 commented 2 years ago

thanks for sharing @jgaskins

So if I disable all the integrations and also remove the require statements

# require "../lib/datadog/src/integrations"
# require "../lib/datadog/src/integrations/db"
# require "../lib/datadog/src/integrations/redis"

Everything goes back to normal;

Screenshot 2022-05-15 at 14 59 22

The Hydra::Index is back

And also shown on the tracing lists

Screenshot 2022-05-15 at 14 59 03

That Hydra guy calls Redis a few times in the lifecycle, but Ping::Index is just application code (a single span) This is why I'm trying to figure out what nested span issue is going on.

I'll check over your work and see if I can implement your changes into mine to see if it can make a difference. I'll ping back my results shortly.

Also, is the way I'm loading the Redis/DB integrations correct?

mjeffrey18 commented 2 years ago

Hey @jgaskins It still didn't work unfrotuantely.

This was my new implementation.

I split the logging and tracing into 2, since I noticed another opportunity to inject the trace data into the logs - the ruby version of this allows this so taking pointers from that.

However, its also not showing the trace data in the logs yet (still investigating)

Screenshot 2022-05-16 at 11 47 44
# Middleware

class AppServer < Lucky::BaseAppServer
  # Learn about middleware with HTTP::Handlers:
  # https://luckyframework.org/guides/http-and-routing/http-handlers
  def middleware : Array(HTTP::Handler)
    [
      Lucky::ForceSSLHandler.new,
      Lucky::RequestIdHandler.new,
      Lucky::HttpMethodOverrideHandler.new,
      TraceHandler.new,
      LogHandler.new, # this is after the TraceHandler so we can inject tracing into logs
      Lucky::RemoteIpHandler.new,
      Cable::Handler(ApplicationCable::Connection).new,
      Honeybadger::Handler.new,
      Lucky::ErrorHandler.new(action: Errors::Show),
      CORSHandler.new,
      Lucky::RouteHandler.new,
      Lucky::RouteNotFoundHandler.new,
    ] of HTTP::Handler
  end

  def protocol
    "http"
  end

  def listen
    # Learn about bind_tcp: https://tinyurl.com/bind-tcp-docs
    server.bind_tcp(host, port, reuse_port: false)
    server.listen
  end
end

# This only takes care of the tracing side, then calls on the logger next

class TraceHandler
  include HTTP::Handler
  include HandlerHelper
  include Traceable

  TRACE_KEYS = {
    trace_name:          "http.server.request",
    http_method:         "http.method",
    http_status_code:    "http.status_code",
    http_resource:       "http.resource",
    http_host:           "http.host",
    http_user_agent:     "http.user_agent",
    http_request_id:     "http.request_id",
    http_version:        "http.version",
    default_status_code: "200",
  }

  def call(context)
    request = context.request
    user_agent = user_agent(context)
    client_version = client_version(user_agent)
    tags = Datadog::Span::Metadata{
      TRACE_KEYS[:http_method]      => request.method,
      TRACE_KEYS[:http_status_code] => TRACE_KEYS[:default_status_code],
      TRACE_KEYS[:http_resource]    => request.resource,
      TRACE_KEYS[:http_host]        => request.hostname || "",
      TRACE_KEYS[:http_user_agent]  => user_agent || "",
      TRACE_KEYS[:http_request_id]  => context.request_id.to_s,
      TRACE_KEYS[:http_version]     => request.version || "",
    }

    # We use the lucky helper to find the action name
    # instead of /path we get Ping::Index
    handler = Lucky.router.find_action(request)
    resource = if handler
                 handler.payload.to_s
               else
                 request.path
               end

    # we only know the status code after we have a response
    # so we yield
    # then update the span tags
    trace TRACE_KEYS[:trace_name], type: "web", resource: resource, tags: tags do |span|
      call_next(context)
      span.tags[TRACE_KEYS[:http_status_code]] = context.response.status_code.to_s
    end
  end
end
# Log formatter default data
# with the current trace injected into the dd key (same as the ruby implementation)

# However, I didn't manage to get this data flowing...

private def default_data
  data = {
    "level"     => entry.severity.to_s.try(&.upcase),
    "logger"    => (entry.source.presence || "Lucky").try(&.titleize),
    "timestamp" => entry.timestamp,
    "ddsource"  => ["crystal"],
    "dd"        => {
      "env"      => ENV["DD_ENV"]?,
      "service"  => ENV["DD_SERVICE"]?,
      "version"  => ENV["DD_VERSION"]?,
      "span_id"  => Datadog.tracer.active_span.try(&.id).to_s,
      "trace_id" => Datadog.tracer.active_span.try(&.trace_id).to_s,
    },
  }
  data["message"] = entry.message unless entry.message.empty?
  data
end
# In Lucky, each controller action is a class and they allow before hooks, so I created a global one with the Traceable module code you shared.

module RequestTracer
  include Traceable

  macro included
    before enrich_tracing
  end

  def enrich_tracing
    set_trace_resource self.class.name.to_s
    continue
  end
end

Again if I remove the integrations loaded, all is good. But with Redis/PG I'm back where I started - But if I get this logging part and resource_name right, this would be awesome!

I did some digging on the datadog dashboard and found a way to expose the trace top level json;

Screenshot 2022-05-16 at 11 25 03

Which confirms my suspicions that I'm sending DD the operation and resource name as the same value. Which eliminates the doubt I had that datadog was getting mixed up with the nesting.

The data looks terrific otherwise :-)

Screenshot 2022-05-16 at 11 25 34

Key things I'm trying to solve;

  1. resource.name is being set as operation_name when there is a nested trace
  2. Tracing is not showing in logging (I compact the key/values)
mjeffrey18 commented 2 years ago

I added some logging within the shard (DefaultTracer#trace) to monitor the requests This is the result without Postgres

web          | TRACE
web          | service_name bonsai-ws
web          | service_type web
web          | name http.server.request
web          | resource Canary::Show
web          | tags {"http.method" => "GET", "http.status_code" => "200", "http.resource" => "/canary/2e46708d-38bd-44a8-9bf1-4572ac7792e8", "http.host" => "localhost", "http.user_agent" => "PostmanRuntime/7.29.0", "http.request_id" => "3b02a6b0-bebf-4064-a508-248a1154c77e", "http.version" => "HTTP/1.1", "restaurant_id" => "63c6a71c-9e6e-420e-b290-d8b1b64fb1e5", "client" => "", "client_version" => ""}
web          | parent_id 0
web          | trace_id 5349861896869271593
web          | span_id 5030324430105770589
web          | --------------------------------
web          | TRACE
web          | service_name redis-primary
web          | service_type db
web          | name query
web          | resource get
web          | tags {}
web          | parent_id 5030324430105770589
web          | trace_id 7715555225025501574
web          | span_id 8062442734162754503
web          | --------------------------------
web          | TRACE
web          | service_name redis-primary
web          | service_type db
web          | name query
web          | resource hgetall
web          | tags {}
web          | parent_id 8062442734162754503
web          | trace_id 7715555225025501574
web          | span_id 1630686651409377103
web          | --------------------------------
web          | TRACE
web          | service_name redis-primary
web          | service_type db
web          | name query
web          | resource publish
web          | tags {}
web          | parent_id 1630686651409377103
web          | trace_id 7715555225025501574
web          | span_id 1410864593744009893
web          | --------------------------------
web          | {"level":"INFO","logger":"Lucky","timestamp":"2022-05-16T14:00:08+04:00","ddsource":["crystal"],"dd":{"env":null,"service":"bonsai-ws","version":"1.2.3","span_id":"","trace_id":""},"duration":0.002599173,"method":"GET","path":"/canary/2e46708d-38bd-44a8-9bf1-4572ac7792e8","request_id":"3b02a6b0-bebf-4064-a508-248a1154c77e","restaurant_id":"63c6a71c-9e6e-420e-b290-d8b1b64fb1e5","status":200,"user_agent":"PostmanRuntime/7.29.0"}

Example with PostgreSQL

Note: the parent_id of the first trace is the trace from the previous HTTP request.

web          | TRACE
web          | service_name bonsai-ws
web          | service_type web
web          | name http.server.request
web          | resource Hydra::Index
web          | tags {"http.method" => "GET", "http.status_code" => "200", "http.resource" => "/hydra", "http.host" => "localhost", "http.user_agent" => "PostmanRuntime/7.29.0", "http.request_id" => "ae589e9f-6660-475c-92b0-f56655977fd1", "http.version" => "HTTP/1.1", "restaurant_id" => "63c6a71c-9e6e-420e-b290-d8b1b64fb1e5", "client" => "", "client_version" => ""}
web          | parent_id 7074771027912256803
web          | trace_id 4960561266873337330
web          | span_id 3416511735297102346
web          | --------------------------------
web          | TRACE
web          | service_name redis-primary
web          | service_type db
web          | name query
web          | resource get
web          | tags {}
web          | parent_id 3416511735297102346
web          | trace_id 4960561266873337330
web          | span_id 7633164790659448547
web          | --------------------------------
web          | TRACE
web          | service_name bonsai-primary
web          | service_type db
web          | name db.query
web          | resource SELECT 1 AS one FROM guests WHERE guests.id = $1 LIMIT 1
web          | tags {}
web          | parent_id 7633164790659448547
web          | trace_id 4960561266873337330
web          | span_id 9057344760719510558
web          | --------------------------------
web          | TRACE
web          | service_name redis-primary
web          | service_type db
web          | name query
web          | resource hgetall
web          | tags {}
web          | parent_id 9057344760719510558
web          | trace_id 4960561266873337330
web          | span_id 2829677955661041065
web          | --------------------------------
web          | {"level":"INFO","logger":"Lucky","timestamp":"2022-05-16T14:06:34+04:00","ddsource":["crystal"],"dd":{"env":null,"service":"bonsai-ws","version":"1.2.3","span_id":"","trace_id":""},"duration":0.056767268,"method":"GET","path":"/hydra","request_id":"ae589e9f-6660-475c-92b0-f56655977fd1","restaurant_id":"63c6a71c-9e6e-420e-b290-d8b1b64fb1e5","status":200,"user_agent":"PostmanRuntime/7.29.0"}

I then hit the ping endpoint

web          | TRACE
web          | service_name bonsai-ws
web          | service_type web
web          | name http.server.request
web          | resource Ping::Index
web          | tags {"http.method" => "GET", "http.status_code" => "200", "http.resource" => "/ping", "http.host" => "localhost", "http.user_agent" => "PostmanRuntime/7.29.0", "http.request_id" => "228b5a8b-f4f8-454d-9e12-0869d88c4926", "http.version" => "HTTP/1.1", "restaurant_id" => "", "client" => "", "client_version" => ""}
web          | parent_id 3798277196643703492
web          | trace_id 4960561266873337330
web          | span_id 6260461103217596780
web          | --------------------------------
web          | {"level":"INFO","logger":"Lucky","timestamp":"2022-05-16T14:11:39+04:00","ddsource":["crystal"],"dd":{"env":null,"service":"bonsai-ws","version":"1.2.3","span_id":"","trace_id":""},"duration":3.5227e-5,"method":"GET","path":"/ping","request_id":"228b5a8b-f4f8-454d-9e12-0869d88c4926","status":200,"user_agent":"PostmanRuntime/7.29.0"}

parent_id is also in this endpoint

I then update the helper to wipe the fiber data after each request

module Traceable
  def trace(*args, **kwargs)
    Datadog.tracer.trace(*args, **kwargs) { |span| yield span }
  end

  # value set in TrackingHandler
  def set_trace_resource(resource : String)
    if span = Datadog.tracer.active_span
      span.resource = resource
    end
  end

  def set_trace_tag(key : String, value : String)
    if span = Datadog.tracer.active_span
      span.tags[key] = value
    end
  end

  def reset_trace!
    Fiber.current.current_datadog_span = nil
    Fiber.current.current_datadog_trace = nil
  end
end

Then I added an after hook for each request to clear the data

module RequestTracer
  include Traceable

  macro included
    before enrich_tracing
    after reset_tracing # this clears out the Fiber.current trace data
  end

  def enrich_tracing
    set_trace_resource self.class.name.to_s
    continue
  end

  def reset_tracing
    reset_trace!
    continue
  end
end

This worked!! Well it worked in the sense that the parent_id of the subsequent request is back to 0 But... The Datadog resource name is still showing the operation name haha The worst part of this weird situation is that on Datadog the logs are showing the correct resource Hydra::Index But I guess that means we have the correct

Screenshot 2022-05-16 at 15 04 55

Regards, the logs, I just realised logging is done on a different Fiber... so will have a figure out a cleaner way of injecting the trace data into the logs.

Still debugging... But I know there is something in the code causing this now.

mjeffrey18 commented 2 years ago

Got it working!!

This guy was the culprit

Inside the Redis integration, the pipeline command was monkey patched to gather all the commands.

  class Pipeline
    @commands = [] of String

    def run(command)
      @commands << command.first

      previous_def
    end

    def commit
      result = previous_def

      # Set the resource to the list of commands run inside the pipeline
      if span = Datadog.tracer.active_span
        span.resource = @commands.join(", ")
      end

      result
    end
  end

I had to change the logic to ONLY override the resource if the current span matched the Redis service

def commit
  result = previous_def

  # Set the resource to the list of commands run inside the pipeline
  if span = Datadog.tracer.active_span
    span.resource = @commands.join(", ") if span.service == "redis-primary"
  end

  result
end

Note: after this change, I was able to remove all the set_resource code also

I also found out that the finish_span logic takes care of sorting out the parent_id issue I thought was there, so all good

I'm now getting these lovely results for both the HTTP server and the Websocket subscriptions.

Screenshot 2022-05-16 at 16 47 03 Screenshot 2022-05-16 at 16 45 21

I'll submit a PR for the following changes if you agree;

  1. Add documentation on how to load the integrations
  2. Add a way to dynamically solve this issue span.resource = @commands.join(", ") if span.service == "redis-primary"
  3. Add a way to override the integrations resource names db.query and query (as if you have a ruby microservice on the same datadog cluster, it means those guys are named differently.
  4. Documentation on the Log <> Tracing (pending finding a solution)

I suppose we may need to create a different Redis adapter for the other Redis shard also. I can maybe have a look at that later. Having a solid APM package is a big reason developers are drawn to languages (IMO) so I would love to get this shard mainstream to get more adoption of Crystal.

Regarding the Crystal logs being on different Fibers, I was trying to find something on the API docs, but I couldn't figure that out. Have you any experience with logs that could point me in the right direction?

There is also 1 final consideration for another day, RUM -> Trace. This can be easily solved by application code as it's a matter of reading the headers and sending those trace_id's etc into the trace block. I suppose I wanted your thoughts on having a generic HTTP::Handler as part of the shard? If you like that idea I can also look into putting this into the PR or in another PR don't the road.

mjeffrey18 commented 2 years ago

Ok I found a possible solution for the logs, there is a context API to the log, so if you set that inside the trace block, the logs generated within that block will have the trace and span ids

module Traceable
  def trace(*args, **kwargs)
    Datadog.tracer.trace(*args, **kwargs) do |span|
      Lucky::Log.dexter.log.context = Log::Metadata.build({trace_id: span.trace_id.to_s, span_id: span.id.to_s})
      yield span
      Lucky::Log.dexter.log.context = Log::Metadata.build({trace_id: "", span_id: ""})
    end
  end
end

I'm using this lucky json formatter, but can be adapted for any foramatter

# Monkey patch method flatten json key/values
# lib/dexter/src/dexter/json_log_formatter.cr
module Dexter
  struct JSONLogFormatter < BaseFormatter
    def call
      context_data = entry.context.to_h
      trace_id = context_data.delete(:trace_id)
      span_id = context_data.delete(:span_id)
      local_data = context_data.delete(:local).try(&.as_h) || context_data.delete("local").try(&.as_h)

      data = default_data
      data = data.merge(local_data) if local_data
      data = data.merge(context_data)
      data = data.merge(datadog_data(span_id, trace_id))

      exception_data.try do |exception_data_|
        data = data.merge(exception_data_)
      end

      data
        .compact
        .to_json(io)
    end

    private def default_data
      data = {
        "level"     => entry.severity.to_s.try(&.upcase),
        "logger"    => (entry.source.presence || "Lucky").try(&.titleize),
        "timestamp" => entry.timestamp,
      }
      data["message"] = entry.message unless entry.message.empty?
      data
    end

    private def datadog_data(span_id, trace_id)
      {
        "ddsource" => ["crystal"],
        "dd"       => {
          "env"      => ENV["DD_ENV"]?,
          "service"  => ENV["DD_SERVICE"]?,
          "version"  => ENV["DD_VERSION"]?,
          "span_id"  => span_id,
          "trace_id" => trace_id,
        },
      }
    end
  end
end

Results

{
  "dd": {
    "env": "staging",
    "service": "bonsai-ws",
    "span_id": "2866245041478857817",
    "trace_id": "2815534784969904792",
    "version": "1.2.3"
  },
  "ddsource": [
    "crystal"
  ],
  "duration": 0.00169134,
  "level": "INFO",
  "logger": "Lucky",
  "method": "GET",
  "path": "/hydra",
  "request_id": "51925971-4944-4a81-b259-beeac5626f70",
  "status": 200,
  "timestamp": "2022-05-16T18:55:11+04:00",
  "user_agent": "PostmanRuntime/7.29.0"
}