typelevel / cats-effect

The pure asynchronous runtime for Scala
https://typelevel.org/cats-effect/
Apache License 2.0
2.01k stars 514 forks source link

Wrong stack trace with timeout #2742

Open geirolz opened 2 years ago

geirolz commented 2 years ago

Given the following code, I get a TimeoutException for 15000 millis that is right BUT the stack trace take me to the line of t2 instead of t3 I was expecting that the stack trace takes me to the timeout exceeded

val t1 = IO.sleep(1.seconds)
val t2 = IO.sleep(2.seconds).timeout(5.seconds)
val t3 = (t1 >> t2).timeout(1500.millis)
t3.unsafeRunSync() // TimeoutException at timeout Line 2

While if I add a timeout at t1 i get the same exception but the stack trace bring me to t1

val t1 = IO.sleep(1.seconds).timeout(5.seconds)
val t2 = IO.sleep(2.seconds).timeout(5.seconds)
val t3 = (t1 >> t2).timeout(1500.millis)
t3.unsafeRunSync() // TimeoutException at timeout Line 1

If there aren't nested timeout i get the right stack trace

val t1 = IO.sleep(1.seconds)
val t2 = IO.sleep(2.seconds)
val t3 = (t1 >> t2).timeout(1500.millis)
t3.unsafeRunSync() // TimeoutException at timeout Line 3

The issue persists with all cats.effect.tracing.mode mode

djspiewak commented 2 years ago

This is very interesting! Can definitely confirm this is a bug.

djspiewak commented 2 years ago

Okay, I figured this out. It has to do with the way that we're keying traces. full tracing mode should have avoided the issue, but that appears to have some other issues that need ironing out.

The problem is that the first one wins on determining the call site for the flatMap inside of timeout. Anything starting with the cats. package is filtered out, which is why the trace lists your source code as being the caller. When you only have the one timeout call, the call site is correct (relative to expectations), and so you see the right thing in the trace (albeit with some repeated frames due to the aforementioned filtering). With multiple timeouts, you'll end up getting the very first as the trace value for the flatMap inside of timeout (since obviously that flatMap is fully shared), meaning that all timeouts will be traced back to that very first evaluation call site, which is precisely why the stack trace is so weird.

This is a bit annoying. Under normal circumstances, we wouldn't be trying to trace timeout at all, it's just the fact that it happens to be derived from flatMap and race which makes it in any way traceable, but of course it's the same flatMap every time, resulting in the unintuitive attribution you reproduced. Going to give some thought to this.

geirolz commented 2 years ago

Sorry for the late reply @djspiewak, thank you for your detailed explanation. The real problem is that when you have several timeouts with the same duration could be difficult to understand which one is expired(this is the problem we had in prod and the reason behind this ticket)

djspiewak commented 2 years ago

I expected as much, but one thing kind of puzzles me: even if the last trace frame is deceptive, wouldn't the previous trace frames be a little more illustrative? That doesn't hold true in the OP example, but a production codebase is unlikely to just consist of a whole bunch of timeout calls.

geirolz commented 2 years ago

In our case we have an engine that executes some steps ( like saga orchestrator pattern ), each step has a max timeout in the engine while each step action is defined as IO so, in some case, we have timeout inside the step action.

That step action sometimes is a composition of multiple IO that could have an inner timeout. So, in case of timeout exception, we have a stack trace that brings us in the first timeout that isn't the one that threw the exception

djspiewak commented 1 year ago

So as a quick note, one workaround (not ideal, but functional) for this problem would be to replace .timeout(5.seconds) with .timeoutTo(5.seconds, IO.defer(IO.raiseError(new TimeoutException))). (credit to @armanbilge for realizing this would solve it)

geirolz commented 1 year ago

Thanks, @djspiewak! not ideal just because it is a bit verbose?

djspiewak commented 1 year ago

Just because it's a bit verbose :-) It's annoying because it's forcing you to effectively inline a combinator in order to manipulate the stack trace. This type of thing is somewhat to be expected, but that doesn't make it ideal.

armanbilge commented 1 year ago

Tested this again with v3.5.0-RC5. Setting cats.effect.tracing.mode=full seems to give accurate stack traces, so if there was a bug there it was since quashed.

For cached tracing it is technically working as it should AFAICT ... and in a production codebase, it seems like there should typically be other context in the trace anyway ... right? 😀

Btw, here's a slightly minimized minimizer.

```scala //> using lib "org.typelevel::cats-effect::3.5.0-RC5" import cats.effect._ import cats.effect.unsafe.implicits._ import scala.concurrent.duration._ @main def main = val t1 = IO.sleep(1.millis).timeout(60.seconds) val t2 = IO.sleep(2.seconds).timeout(1.seconds) // line 9 val t3 = t1 *> t2 t3.unsafeRunSync() ``` ``` Exception in thread "main" java.util.concurrent.TimeoutException: 1 second at timeout @ test$package$.main(test.scala:8) at timeout @ test$package$.main(test.scala:8) at unsafeRunSync @ test$package$.main(test.scala:11) at *> @ test$package$.main(test.scala:10) at timeout @ test$package$.main(test.scala:8) at timeout @ test$package$.main(test.scala:8) ```

The only way I can think to fix this for cached tracing is maybe if we collected some other trace event at the timeout call site and inserted it into the stack. E.g. if it was => Duration. But that's a rather bespoke change that would have to happen in the typeclasses.

So I propose we either:

  1. close this as a won't-fix / can't-fix / by-design / annoying-workaround-available
  2. transform into a more focused issue: add trace events for timeout

Btw, another related idea I had: we currently have IO.trace: IO[Trace] which returns the current trace. But I wonder if we could have something like:

object IO {
  def traced[A](=> ioa: IO[A]): IO[A] = ???
}

It would basically be like defer but collect and insert(/replace?) a tracing event via the lambda, as an easier workaround for these kinds of issues. Actually, I don't yet fully understand why this doesn't work with defer already.