spandex-project / spandex_datadog

A datadog adapter for the `spandex` library.
MIT License
59 stars 42 forks source link

UnicodeConversionError causes ApiServer crash #41

Closed tapickell closed 3 years ago

tapickell commented 3 years ago

We are seeing this issue in testing locally and when deployed to production on some of our services that just adopted using Spandex Datadog.

Forgive me if this is something simple and not a bug but I am not the one that implemented it and am having a heck of a time trying to debug what exactly is happening when we see this error.

[error] [18:32:05.271] Task #PID<0.9958.0> started from SpandexDatadog.ApiServer terminating
** (UnicodeConversionError) invalid encoding starting at <<207, 9, 19, 82, 112, 251, 94, 245, 40>>
    (elixir 1.10.4) lib/list.ex:944: List.to_string/1
    lib/exvcr/adapter/hackney/converter.ex:6: ExVCR.Adapter.Hackney.Converter."parse_request_body (overridable 1)"/1
    lib/exvcr/adapter/hackney.ex:42: ExVCR.Adapter.Hackney.generate_keys_for_request/1
    lib/exvcr/handler.ex:27: ExVCR.Handler.get_response_from_cache/2
    lib/exvcr/handler.ex:17: ExVCR.Handler.get_response/2
    (httpoison 1.8.0) lib/httpoison/base.ex:846: HTTPoison.Base.request/6
    (spandex_datadog 1.1.0) lib/spandex_datadog/api_server.ex:160: SpandexDatadog.ApiServer.send_and_log/2
    (spandex_datadog 1.1.0) lib/spandex_datadog/api_server.ex:231: anonymous fn/2 in SpandexDatadog.ApiServer.maybe_flush_traces/1
    (elixir 1.10.4) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
    (stdlib 3.14.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
    (hackney 1.17.4) :hackney.request(:put, "http://localhost:8126/v0.3/traces", [{"Content-Type", "application/msgpack"}, {"X-Datadog-Trace-Count", 1}], [145, [151, [140, [164 | "type"], [162 | "db"], [168 | "trace_id"], <<207, 9, 19, 82, 112, 251, 94, 245, 40>>, [165 | "start"], <<207, 22, 130, 36, 169, 71, 26, 253, 246>>, [167 | "span_id"], <<207, 81, 215, 254, 176, 217, 250, 251, 255>>, [167 | "service"], [164 | "ecto"], [168 | "resource"], [[217, 83] | "SELECT ....... from ......... where ......."], [166 | "sql.db"], [173 | "repo_database"], [171 | "param_count"], [161 | "2"]], [165 | "error"], [0], [168 | "duration"], <<206, 0, 1, 56, 121>>], [140, [164 | "type"], [163 | "web"], [168 | "trace_id"], <<207, 9, 19, 82, 112, 251, 94, 245, 40>>, [165 | "start"], <<207, 22, 130, 36, 169, 71, 23, 154, 50>>, [167 | "span_id"], <<207, 25, 16, 150, 89, 227, 57, 154, 221>>, [167 | "service"], [163 | "oms"], [168 | "resource"], [[217, 67] | "MyAppWeb.ParentContext.V1.ChildController.phoenix_controller_pipeline/2"], [169 | "parent_id"], <<207, 69, 184, 136, 169, 18, 144, 97, 53>>, [164 | "name"], [[217, 67] | "MyAppWeb.ParentContext.V1.ChildController.phoenix_controller_pipeline/2"], [167 | "metrics"], [129, [181 | "_sampling_priority_v1"], [1]], [164 | "meta"], [128], [165 | "error"], [0], [168 | "duration"], <<206, 0, 19, 230, 135>>], [140, [164 | "type"], [163 | "web"], [168 | "trace_id"], <<207, 9, 19, 82, 112, 251, 94, 245, 40>>, [165 | "start"], <<207, 22, 130, 36, 169, 71, 22, 102, 254>>, [167 | "span_id"], <<207, 69, 184, 136, 169, 18, 144, 97, 53>>, [167 | "service"], [163 | "oms"], [168 | "resource"], [[217, 51] | "Elixir.MyAppWeb.ParentContext.V1.ChildController.update"], [169 | "parent_id"], <<207, 57, 43, 142, 229, 191, 162, 23, 10>>, [164 | "name"], [179 | "phx.router_dispatch"], [167 | "metrics"], [129, [181 | "_sampling_priority_v1"], [1]], [164 | "meta"], [128], [165 | "error"], [0], [168 | "duration"], <<206, 0, 21, 66, 161>>], [139, [164 | "type"], [163 | "web"], [168 | "trace_id"], <<207, 9, 19, 82, 112, 251, 94, 245, 40>>, [165 | "start"], <<207, 22, 130, 36, 169, 71, 20, 106, 53>>, [167 | "span_id"], <<207, 57, 43, 142, 229, 191, 162, 23, 10>>, [167 | "service"], [163 | "oms"], [168 | "resource"], [[217, 32] | "PATCH /parent_route/v1/child_route/:id"], [164 | "name"], [167 | "request"], [167 | "metrics"], [129, [181 | "_sampling_priority_v1"], [1]], [164 | "meta"], [131, [168 | "http.url"], [190 | "/parent_route/v1/child_route/1234567"], [176 | "http.status_code"], [163 | "400"], [171 | "http.method"], [165 | "PATCH"]], [165 | "error"], [0], [168 | "duration"], <<206, 0, 23, 93, 159>>]]], [])

Seems to be the first bitstring in the hackney request. I am not sure where that comes from, we do not make use of bitstrings in this service so I am not sure where it originates. Could be a utf-8 encoding issue with something that is expected to be parsable to a string, I have seen that in Nerves projects where we make heavy use of bitstrings and have used the wrong number to represent the character.

   (hackney 1.17.4) :hackney.request(:put, "http://localhost:8126/v0.3/traces", [{"Content-Type", "application/msgpack"}, {"X-Datadog-Trace-Count", 1}], [145, [151, [140, [164 | "type"], [162 | "db"], [168 | "trace_id"], <<207, 9, 19, 82, 112, 251, 94, 245, 40>>, [165 | "start"]..., 

I do not really understand the format of the trace either, this may be something simple I am unaware of b/c I have yet to do much with telemetry stuff.

Any help would be greatly appreciated in figuring out what is causing this to error and blow up.

GregMefford commented 3 years ago

Hey there, thanks for opening an issue! I think the problem might be that ExVCR doesn’t work with the Msgpack protocol that Datadog uses. I’ve seen that before, and the solution for me was to have ExVCR ignore those since there’s not much sense in mocking those calls anyway unless you wanted to assert that certain telemetry was sent (but in that case, I’d recommend using your own replacement for the Adapter or ApiServer module in test, or something like that). I believe ExVCR has an option to ignore certain hosts IIRC, but I haven’t used it in a while.

Please let me know if that solves it for you! 🚀

tapickell commented 3 years ago

Thank you @GregMefford for all your help. It is a testing issue with ExVcr. This solves the issue thank you