mtth / tracing

Distributed tracing
https://hackage.haskell.org/package/tracing
BSD 3-Clause "New" or "Revised" License
24 stars 11 forks source link

Discarding of HTTP result might hide failures. #28

Open alaendle opened 6 months ago

alaendle commented 6 months ago

https://github.com/mtth/tracing/blob/aa0cc7001795b5ccf9fbc87a9fa33282bdda5779/src/Monitor/Tracing/Zipkin.hs#L123

Do we need to check the HTTP status? I'm unsure about the impact - but in my case a unreachable address resulted in a 502 Bad Gateway status.

Maybe we should check for 2XX results (statusIsSuccessful)? But what should we do in other cases? Just output a warning?

mtth commented 6 months ago

I think the best is to expose these errors to users so they can be handled appropriately for the use-case - there isn't a one-size-fits-all handling strategy. Would https://github.com/mtth/tracing/pull/9 with an additional status check work? It exposes a separate publishSpans function and typed exception to provide this flexibility without bundling too much logic in the constructor.

alaendle commented 6 months ago

@mtth Sorry for the late response, took some time until I found a chance to think it through - from my perspective #9 alone wouldn't solve everything perfectly. Maybe I've misunderstood something, but I think the following shortcoming is there:

So with that said, by adding a additional check, you could solve the issue! But I fear the price we pay is a really hard to understand API usage. From my perspective we should strive for a solution that works with the ZPK.with zpkSettings . ZPK.run function. So I wonder if it makes sense to establish some kind of optional exception handler in the settings record (rasing the exception could be the default handler)? Here the user could really decide what to do and the operation could be resumed (by internally re-queueing the spans).

And don't get me wrong, this is by no way a critic. I'm thankful for any improvement and value your development efforts ❤️ ! So please take this just as a side note/comment - I could be wrong.

mtth commented 6 months ago

Thanks for the thoughtful reply, @alaendle. Both approaches are complementary, we can both:

  1. expose primitives for clients to do custom error handling (that's #9);
  2. enhance the current default convenience behavior, as long as it's generic and simple enough.

For 2, we could for example add the HTTP status check and buffer spans on failure until the next retry. We'd add some state to Zipkin (maybe a TVar Seq, with a configurable limit to avoid OOMs). What do you think?