spandex-project / spandex_phoenix

Phoenix Instrumentation tracer
MIT License
82 stars 29 forks source link

Slightly different traces cause Datadog to ignore errors Phoenix 1.5 + Telemetry #46

Closed msramos closed 3 years ago

msramos commented 3 years ago

So I've recently upgraded a project from Phoenix 1.4 to 1.5 - which means that now instead of using SpandexPhoenix.Instrumenter I'm using SpandexPhoenix.Telemetry.

Since the project was deployed, I noticed that the 400 and 500 errors stopped being reported.

Investigating a little further, I added a divison by zero on one of my controllers and extracted the output from spandex before it sends to datadog. This is what I got:

Using the instrumenter

Trace generated by using the instrumenter ```elixir %Spandex.Trace{ baggage: [], id: 3604773592068834834, priority: 1, spans: [ %Spandex.Span{ completion_time: 1613154595602799782, env: nil, error: nil, # <-------- this trace has no error http: nil, id: 3096177269177183673, name: "Phoenix.Controller", parent_id: 896062065962279644, private: [], resource: "Elixir.ServerWeb.HomeController.show", service: :website, services: [], sql_query: nil, start: 1613154595602639932, tags: [], trace_id: 3604773592068834834, type: :web }, %Spandex.Span{ # external request }, %Spandex.Span{ # external request }, %Spandex.Span{ completion_time: 1613154595611185089, env: nil, error: [ # <------ this trace has the error and it's also the topmost span: DataDog will consider this as an error exception: %ArithmeticError{ message: "bad argument in arithmetic expression" }, stacktrace: [ # a very big stacktrace ], error?: true ], http: [ method: "GET", query_string: "", status_code: 500, url: "/", user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36" ], id: 896062065962279644, name: "request", parent_id: nil, private: [], resource: "GET /", service: :website, services: [], sql_query: nil, start: 1613154594473197138, tags: [], trace_id: 3604773592068834834, type: :web } ], stack: [] } ```

Which is reported on DataDog as:

good_datadog

This is correct and DataDog reports this trace as an error

Using telemetry

Trace generated when using telemetry ```elixir %Spandex.Trace{ baggage: [], id: 625532881523353555, priority: 1, spans: [ %Spandex.Span{ completion_time: 1613154203997459845, env: nil, error: [ # <--- this span has the error! exception: %Plug.Conn.WrapperError{ conn: %Plug.Conn{ #conn data }, stacktrace: [], error?: true ], http: nil, id: 5488488812988202553, name: "phx.router_dispatch", # <- Different span name parent_id: 7832421358083905014, private: [], resource: "Elixir.ServerWeb.HomeController.show", service: :website, services: [], sql_query: nil, start: 1613154202618683661, tags: [], trace_id: 625532881523353555, type: :web }, %Spandex.Span{ # external request }, %Spandex.Span{ # external request }, %Spandex.Span{ completion_time: 1613154204008839353, env: nil, error: nil, # <--- this is the top span and doesn't have any errors: datadog will not consider the trace as havin an error! http: [ method: "GET", query_string: "", status_code: 500, url: "/", user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36" ], id: 7832421358083905014, name: "request", parent_id: nil, private: [], resource: "GET /", service: :website, services: [], sql_query: nil, start: 1613154202616308490, tags: [], trace_id: 625532881523353555, type: :web } ], stack: [] } ```

Which is reported on DataDog as:

wrong_datadog

This is wrong and DataDog does not report this trace as an error

Questions

GregMefford commented 3 years ago

Thank you for this very detailed issue report! ❤️ To summarize my understanding of what's happening: after upgrading to use the new Phoenix telemetry instead of the old instrumenter, the same error is still being reported, but now the error shows up on the span that represents the Phoenix Controller action instead of the one that represents the Phoenix Endpoint, so Datadog does not mark the top-level trace as an error - only the Controller span.

I believe that this is unfortunately due to a change in the way Phoenix itself reports its telemetry: if there is an exception raised in a Controller action, there will be a [:phoenix, :router_dispatch, :exception] Telemetry event fired instead of the normal [:phoenix, :router_dispatch, :stop] event. This is how we can tell that there was an error and mark the Controller span as having an error. The GET / span in your screenshot is actually generated by the Plug.Telemetry, event_prefix: [:phoenix, :endpoint] line that you probably have in your app's Endpoint somewhere. Unfortunately, this only generates [:phoenix, :endpoint, :start] and [:phoenix, :endpoint, :stop] events, and if there is an exception, it does not fire an :exception event or a :stop event.

In your case, the exception is being caught somewhere between the Controller action and the Endpoint, so that there is no longer an Exception in the Elixir runtime, because Phoenix has caught it so that it can turn it into a 500 error response to the client. Spandex by default does not mark those "successful 500" responses as an error, although there's a case to be made that it should. I don't think it would be good to switch this default for all users of Spandex because it has been this way for a long time now and it would be a breaking change to mark all 5xx traces as error on an upgrade where they weren't before.

You can change this behavior yourself if you want to, though, using the customize_metadata callback when you install the integration. This callback receives a Plug.Conn struct, which you can use however you want to return the metadata you want to apply to the top-level trace. The default behavior is to call SpandexPhoenix.default_metadata/1, so if you mostly want what you have today, simply call that first to get your base metadata, then add or remove things before returning it, like this (I don't know if this really compiles or would work, it's more of a sketch of the idea):

def my_customize_metadata(%{status: status} = conn) when is_integer(status) and status >= 500 do
  conn
  |> SpandexPhoenix.default_metadata()
  |> Keyword.put(:error, [error?: true])
end

def my_customize_metadata(conn), do: SpandexPhoenix.default_metadata(conn)

This will set the error.error? metadata to true, which tells Datadog to mark the span as an error. You unfortunately wouldn't have access to the Exception or stacktrace at this point in the code, due to the reasons described above. You can also conditionally do anything else you want with the metadata, based on whatever conditions can be determined using a Plug.Conn struct.

I hope that helps, and feel free to let me know if that doesn't work or if you get stuck trying to implement it.

msramos commented 3 years ago

Thanks for the response @GregMefford !

As a follow-up: the solution you provided worked and the errors are correctly being reported on datadog!

mvpgomes commented 2 years ago

@GregMefford I was about to open an issue to report exactly this problem. In my case here is the scenario:

While reading Telemetry documentation, I noticed we have a [:phoenix, :error_rendered] event that is emitted by Phoenix after an Error view is rendered. What you think in adding an extra handler in telemetry to process such events and mark the spans as errors?