Open mjeffrey18 opened 2 years ago
Interesting. I've seen something like this before, but could never figure out why it was happening and it only happens every so often, not as consistently as you seem to be seeing it. For example, here's what I'm seeing:
I'm not using Lucky, which might be part of it. What is calling Datadog.tracer.trace "http.request"
?
I'm not using Lucky, which might be part of it
By this, I mean I may not be running into the same things you are because I'm wiring it up differently. Not that Lucky is the problem. 😄
There is one trick I'm having to do regarding reporting the http status code, but it doesn't seem to change anything with the endpoints/websocket channels without redis or postgres. Only the endpoints with the redis and postgres spans nested seem to have the issue.
I implemented the http.request trace like this, using a custom http handler middleware
This is the trace block
For the websocket trace, i'm just wrapping the subscribe method using a similar way
Hmm, that seems pretty similar to what I'm doing:
# Included into individual routes to inspect the path taken during routing,
# including into nested routes.
module Traceable
def trace(*args, **kwargs)
Datadog.tracer.trace(*args, **kwargs) { |span| yield span }
end
# Called in route blocks to override the default "#{request.method} #{request.resource}"
# value set in DatadogHandler
def set_resource(resource : String)
if span = Datadog.tracer.active_span
span.resource = resource
end
end
def set_tag(key : String, value : String)
if span = Datadog.tracer.active_span
span.tags[key] = value
end
end
end
# Added to HTTP::Server middleware
class DatadogHandler
include HTTP::Handler
include Traceable
def call(context)
request = context.request
resource = "#{request.method} #{request.resource}"
tags = Datadog::Span::Metadata {
"http.method" => request.method,
"http.resource" => request.resource,
"http.host" => request.hostname || "",
"http.user_agent" => request.headers["User-Agent"]? || "",
}
trace "http.server.request", type: "http", resource: resource, tags: tags do |span|
call_next context
span.tags["http.status_code"] = context.response.status_code.to_s
span.tags["http.response.content_type"] = context.response.headers["Content-Type"]? || ""
end
end
end
And then in my routes:
struct Inbox
# ...
include Traceable
def initialize(@current_user : User)
end
def call(context)
route context do |r, response, session|
set_resource "Inbox"
# ...
end
end
end
The fact that I'm calling set_resource
in my routes may be part of why I'm not seeing this, but I'm not sure.
thanks for sharing @jgaskins
So if I disable all the integrations and also remove the require statements
# require "../lib/datadog/src/integrations"
# require "../lib/datadog/src/integrations/db"
# require "../lib/datadog/src/integrations/redis"
Everything goes back to normal;
The Hydra::Index is back
And also shown on the tracing lists
That Hydra guy calls Redis a few times in the lifecycle, but Ping::Index is just application code (a single span) This is why I'm trying to figure out what nested span issue is going on.
I'll check over your work and see if I can implement your changes into mine to see if it can make a difference. I'll ping back my results shortly.
Also, is the way I'm loading the Redis/DB integrations correct?
Hey @jgaskins It still didn't work unfrotuantely.
This was my new implementation.
I split the logging and tracing into 2, since I noticed another opportunity to inject the trace data into the logs - the ruby version of this allows this so taking pointers from that.
However, its also not showing the trace data in the logs yet (still investigating)
# Middleware
class AppServer < Lucky::BaseAppServer
# Learn about middleware with HTTP::Handlers:
# https://luckyframework.org/guides/http-and-routing/http-handlers
def middleware : Array(HTTP::Handler)
[
Lucky::ForceSSLHandler.new,
Lucky::RequestIdHandler.new,
Lucky::HttpMethodOverrideHandler.new,
TraceHandler.new,
LogHandler.new, # this is after the TraceHandler so we can inject tracing into logs
Lucky::RemoteIpHandler.new,
Cable::Handler(ApplicationCable::Connection).new,
Honeybadger::Handler.new,
Lucky::ErrorHandler.new(action: Errors::Show),
CORSHandler.new,
Lucky::RouteHandler.new,
Lucky::RouteNotFoundHandler.new,
] of HTTP::Handler
end
def protocol
"http"
end
def listen
# Learn about bind_tcp: https://tinyurl.com/bind-tcp-docs
server.bind_tcp(host, port, reuse_port: false)
server.listen
end
end
# This only takes care of the tracing side, then calls on the logger next
class TraceHandler
include HTTP::Handler
include HandlerHelper
include Traceable
TRACE_KEYS = {
trace_name: "http.server.request",
http_method: "http.method",
http_status_code: "http.status_code",
http_resource: "http.resource",
http_host: "http.host",
http_user_agent: "http.user_agent",
http_request_id: "http.request_id",
http_version: "http.version",
default_status_code: "200",
}
def call(context)
request = context.request
user_agent = user_agent(context)
client_version = client_version(user_agent)
tags = Datadog::Span::Metadata{
TRACE_KEYS[:http_method] => request.method,
TRACE_KEYS[:http_status_code] => TRACE_KEYS[:default_status_code],
TRACE_KEYS[:http_resource] => request.resource,
TRACE_KEYS[:http_host] => request.hostname || "",
TRACE_KEYS[:http_user_agent] => user_agent || "",
TRACE_KEYS[:http_request_id] => context.request_id.to_s,
TRACE_KEYS[:http_version] => request.version || "",
}
# We use the lucky helper to find the action name
# instead of /path we get Ping::Index
handler = Lucky.router.find_action(request)
resource = if handler
handler.payload.to_s
else
request.path
end
# we only know the status code after we have a response
# so we yield
# then update the span tags
trace TRACE_KEYS[:trace_name], type: "web", resource: resource, tags: tags do |span|
call_next(context)
span.tags[TRACE_KEYS[:http_status_code]] = context.response.status_code.to_s
end
end
end
# Log formatter default data
# with the current trace injected into the dd key (same as the ruby implementation)
# However, I didn't manage to get this data flowing...
private def default_data
data = {
"level" => entry.severity.to_s.try(&.upcase),
"logger" => (entry.source.presence || "Lucky").try(&.titleize),
"timestamp" => entry.timestamp,
"ddsource" => ["crystal"],
"dd" => {
"env" => ENV["DD_ENV"]?,
"service" => ENV["DD_SERVICE"]?,
"version" => ENV["DD_VERSION"]?,
"span_id" => Datadog.tracer.active_span.try(&.id).to_s,
"trace_id" => Datadog.tracer.active_span.try(&.trace_id).to_s,
},
}
data["message"] = entry.message unless entry.message.empty?
data
end
# In Lucky, each controller action is a class and they allow before hooks, so I created a global one with the Traceable module code you shared.
module RequestTracer
include Traceable
macro included
before enrich_tracing
end
def enrich_tracing
set_trace_resource self.class.name.to_s
continue
end
end
Again if I remove the integrations loaded, all is good. But with Redis/PG I'm back where I started - But if I get this logging part and resource_name right, this would be awesome!
I did some digging on the datadog dashboard and found a way to expose the trace top level json;
Which confirms my suspicions that I'm sending DD the operation and resource name as the same value. Which eliminates the doubt I had that datadog was getting mixed up with the nesting.
The data looks terrific otherwise :-)
Key things I'm trying to solve;
I added some logging within the shard (DefaultTracer#trace) to monitor the requests This is the result without Postgres
web | TRACE
web | service_name bonsai-ws
web | service_type web
web | name http.server.request
web | resource Canary::Show
web | tags {"http.method" => "GET", "http.status_code" => "200", "http.resource" => "/canary/2e46708d-38bd-44a8-9bf1-4572ac7792e8", "http.host" => "localhost", "http.user_agent" => "PostmanRuntime/7.29.0", "http.request_id" => "3b02a6b0-bebf-4064-a508-248a1154c77e", "http.version" => "HTTP/1.1", "restaurant_id" => "63c6a71c-9e6e-420e-b290-d8b1b64fb1e5", "client" => "", "client_version" => ""}
web | parent_id 0
web | trace_id 5349861896869271593
web | span_id 5030324430105770589
web | --------------------------------
web | TRACE
web | service_name redis-primary
web | service_type db
web | name query
web | resource get
web | tags {}
web | parent_id 5030324430105770589
web | trace_id 7715555225025501574
web | span_id 8062442734162754503
web | --------------------------------
web | TRACE
web | service_name redis-primary
web | service_type db
web | name query
web | resource hgetall
web | tags {}
web | parent_id 8062442734162754503
web | trace_id 7715555225025501574
web | span_id 1630686651409377103
web | --------------------------------
web | TRACE
web | service_name redis-primary
web | service_type db
web | name query
web | resource publish
web | tags {}
web | parent_id 1630686651409377103
web | trace_id 7715555225025501574
web | span_id 1410864593744009893
web | --------------------------------
web | {"level":"INFO","logger":"Lucky","timestamp":"2022-05-16T14:00:08+04:00","ddsource":["crystal"],"dd":{"env":null,"service":"bonsai-ws","version":"1.2.3","span_id":"","trace_id":""},"duration":0.002599173,"method":"GET","path":"/canary/2e46708d-38bd-44a8-9bf1-4572ac7792e8","request_id":"3b02a6b0-bebf-4064-a508-248a1154c77e","restaurant_id":"63c6a71c-9e6e-420e-b290-d8b1b64fb1e5","status":200,"user_agent":"PostmanRuntime/7.29.0"}
Example with PostgreSQL
Note: the
parent_id
of the first trace is the trace from the previous HTTP request.
web | TRACE
web | service_name bonsai-ws
web | service_type web
web | name http.server.request
web | resource Hydra::Index
web | tags {"http.method" => "GET", "http.status_code" => "200", "http.resource" => "/hydra", "http.host" => "localhost", "http.user_agent" => "PostmanRuntime/7.29.0", "http.request_id" => "ae589e9f-6660-475c-92b0-f56655977fd1", "http.version" => "HTTP/1.1", "restaurant_id" => "63c6a71c-9e6e-420e-b290-d8b1b64fb1e5", "client" => "", "client_version" => ""}
web | parent_id 7074771027912256803
web | trace_id 4960561266873337330
web | span_id 3416511735297102346
web | --------------------------------
web | TRACE
web | service_name redis-primary
web | service_type db
web | name query
web | resource get
web | tags {}
web | parent_id 3416511735297102346
web | trace_id 4960561266873337330
web | span_id 7633164790659448547
web | --------------------------------
web | TRACE
web | service_name bonsai-primary
web | service_type db
web | name db.query
web | resource SELECT 1 AS one FROM guests WHERE guests.id = $1 LIMIT 1
web | tags {}
web | parent_id 7633164790659448547
web | trace_id 4960561266873337330
web | span_id 9057344760719510558
web | --------------------------------
web | TRACE
web | service_name redis-primary
web | service_type db
web | name query
web | resource hgetall
web | tags {}
web | parent_id 9057344760719510558
web | trace_id 4960561266873337330
web | span_id 2829677955661041065
web | --------------------------------
web | {"level":"INFO","logger":"Lucky","timestamp":"2022-05-16T14:06:34+04:00","ddsource":["crystal"],"dd":{"env":null,"service":"bonsai-ws","version":"1.2.3","span_id":"","trace_id":""},"duration":0.056767268,"method":"GET","path":"/hydra","request_id":"ae589e9f-6660-475c-92b0-f56655977fd1","restaurant_id":"63c6a71c-9e6e-420e-b290-d8b1b64fb1e5","status":200,"user_agent":"PostmanRuntime/7.29.0"}
I then hit the ping endpoint
web | TRACE
web | service_name bonsai-ws
web | service_type web
web | name http.server.request
web | resource Ping::Index
web | tags {"http.method" => "GET", "http.status_code" => "200", "http.resource" => "/ping", "http.host" => "localhost", "http.user_agent" => "PostmanRuntime/7.29.0", "http.request_id" => "228b5a8b-f4f8-454d-9e12-0869d88c4926", "http.version" => "HTTP/1.1", "restaurant_id" => "", "client" => "", "client_version" => ""}
web | parent_id 3798277196643703492
web | trace_id 4960561266873337330
web | span_id 6260461103217596780
web | --------------------------------
web | {"level":"INFO","logger":"Lucky","timestamp":"2022-05-16T14:11:39+04:00","ddsource":["crystal"],"dd":{"env":null,"service":"bonsai-ws","version":"1.2.3","span_id":"","trace_id":""},"duration":3.5227e-5,"method":"GET","path":"/ping","request_id":"228b5a8b-f4f8-454d-9e12-0869d88c4926","status":200,"user_agent":"PostmanRuntime/7.29.0"}
parent_id
is also in this endpoint
I then update the helper to wipe the fiber data after each request
module Traceable
def trace(*args, **kwargs)
Datadog.tracer.trace(*args, **kwargs) { |span| yield span }
end
# value set in TrackingHandler
def set_trace_resource(resource : String)
if span = Datadog.tracer.active_span
span.resource = resource
end
end
def set_trace_tag(key : String, value : String)
if span = Datadog.tracer.active_span
span.tags[key] = value
end
end
def reset_trace!
Fiber.current.current_datadog_span = nil
Fiber.current.current_datadog_trace = nil
end
end
Then I added an after hook for each request to clear the data
module RequestTracer
include Traceable
macro included
before enrich_tracing
after reset_tracing # this clears out the Fiber.current trace data
end
def enrich_tracing
set_trace_resource self.class.name.to_s
continue
end
def reset_tracing
reset_trace!
continue
end
end
This worked!! Well it worked in the sense that the parent_id of the subsequent request is back to 0
But... The Datadog resource name is still showing the operation name haha
The worst part of this weird situation is that on Datadog the logs are showing the correct resource Hydra::Index
But I guess that means we have the correct
Regards, the logs, I just realised logging is done on a different Fiber... so will have a figure out a cleaner way of injecting the trace data into the logs.
Still debugging... But I know there is something in the code causing this now.
Got it working!!
This guy was the culprit
Inside the Redis integration, the pipeline command was monkey patched to gather all the commands.
class Pipeline
@commands = [] of String
def run(command)
@commands << command.first
previous_def
end
def commit
result = previous_def
# Set the resource to the list of commands run inside the pipeline
if span = Datadog.tracer.active_span
span.resource = @commands.join(", ")
end
result
end
end
I had to change the logic to ONLY override the resource if the current span matched the Redis service
def commit
result = previous_def
# Set the resource to the list of commands run inside the pipeline
if span = Datadog.tracer.active_span
span.resource = @commands.join(", ") if span.service == "redis-primary"
end
result
end
Note: after this change, I was able to remove all the
set_resource
code also
I also found out that the finish_span
logic takes care of sorting out the parent_id issue I thought was there, so all good
I'm now getting these lovely results for both the HTTP server and the Websocket subscriptions.
I'll submit a PR for the following changes if you agree;
span.resource = @commands.join(", ") if span.service == "redis-primary"
db.query
and query
(as if you have a ruby microservice on the same datadog cluster, it means those guys are named differently. I suppose we may need to create a different Redis adapter for the other Redis shard also. I can maybe have a look at that later. Having a solid APM package is a big reason developers are drawn to languages (IMO) so I would love to get this shard mainstream to get more adoption of Crystal.
Regarding the Crystal logs being on different Fibers, I was trying to find something on the API docs, but I couldn't figure that out. Have you any experience with logs that could point me in the right direction?
There is also 1 final consideration for another day, RUM -> Trace. This can be easily solved by application code as it's a matter of reading the headers and sending those trace_id's etc into the trace block.
I suppose I wanted your thoughts on having a generic HTTP::Handler
as part of the shard? If you like that idea I can also look into putting this into the PR or in another PR don't the road.
Ok I found a possible solution for the logs, there is a context API to the log, so if you set that inside the trace block, the logs generated within that block will have the trace and span ids
module Traceable
def trace(*args, **kwargs)
Datadog.tracer.trace(*args, **kwargs) do |span|
Lucky::Log.dexter.log.context = Log::Metadata.build({trace_id: span.trace_id.to_s, span_id: span.id.to_s})
yield span
Lucky::Log.dexter.log.context = Log::Metadata.build({trace_id: "", span_id: ""})
end
end
end
I'm using this lucky json formatter, but can be adapted for any foramatter
# Monkey patch method flatten json key/values
# lib/dexter/src/dexter/json_log_formatter.cr
module Dexter
struct JSONLogFormatter < BaseFormatter
def call
context_data = entry.context.to_h
trace_id = context_data.delete(:trace_id)
span_id = context_data.delete(:span_id)
local_data = context_data.delete(:local).try(&.as_h) || context_data.delete("local").try(&.as_h)
data = default_data
data = data.merge(local_data) if local_data
data = data.merge(context_data)
data = data.merge(datadog_data(span_id, trace_id))
exception_data.try do |exception_data_|
data = data.merge(exception_data_)
end
data
.compact
.to_json(io)
end
private def default_data
data = {
"level" => entry.severity.to_s.try(&.upcase),
"logger" => (entry.source.presence || "Lucky").try(&.titleize),
"timestamp" => entry.timestamp,
}
data["message"] = entry.message unless entry.message.empty?
data
end
private def datadog_data(span_id, trace_id)
{
"ddsource" => ["crystal"],
"dd" => {
"env" => ENV["DD_ENV"]?,
"service" => ENV["DD_SERVICE"]?,
"version" => ENV["DD_VERSION"]?,
"span_id" => span_id,
"trace_id" => trace_id,
},
}
end
end
end
Results
{
"dd": {
"env": "staging",
"service": "bonsai-ws",
"span_id": "2866245041478857817",
"trace_id": "2815534784969904792",
"version": "1.2.3"
},
"ddsource": [
"crystal"
],
"duration": 0.00169134,
"level": "INFO",
"logger": "Lucky",
"method": "GET",
"path": "/hydra",
"request_id": "51925971-4944-4a81-b259-beeac5626f70",
"status": 200,
"timestamp": "2022-05-16T18:55:11+04:00",
"user_agent": "PostmanRuntime/7.29.0"
}
hey @jgaskins, thanks for this awesome shard!
I'm having a few issues putting together the data on DD. Namely the primary
span.resource
keeps showing up in the redis/PG integrations and also the main web resource is showing their spans.Here is my service layout
bonsai-ws
has the following 2 main tracesEverything works fine when I don't have the integrations setup (well except for the fact I have no data from redis/postgres).
But generally, I'm seeing traces from both of the trace names, all split into controllers or channels.
Like this example;
Here is how I have the setup at the moment;
With the integrations live
Notice how some of the traces now don't have a resource name, instead they are using the trace name. This seem to be only if x trace has sub queries, i.e. redis or postgres is being traced are children.
Websocket example
Now that the integrations are live, the previously call
ChangesChannel
resource has been replaced by the trace name since it has some sub spansRedis service is fine
This leads me to believe there is something going on with the parent/active spans, perhaps the redis/postgres spans are override the resource name of the main span.
However, I couldn't find anything obvious in the code.
Have you found this an issue or do I have the wrong setup?