LEventStore timeout is caught instead of propagated

jimlyndon commented 7 years ago

As requested, recreating the issue defined at https://github.com/actionml/template-scala-parallel-universal-recommendation/issues/28 to this repo to resolve. Read up on the original issue for overview.

jimlyndon commented 7 years ago

So I was looking into starting on this issue, but had some thoughts around it that needed discussing first. The user experience for an end user should be to receive a HTTP 503 service unavailable error with possibly an additional body/message that explains that the event server is unavailable.

Having said that, the easiest solution is to simply un-catch the timeout exception allowing it to propagate up to the prediction server where it can be caught, and used to build an appropriate 503 response, something like:

case e: scala.concurrent.TimeoutException =>
  val msg = s"Query:\n$queryString\n\nStack Trace:\n" +
   ...
  complete(StatusCodes.ServiceUnavailable, msg)

However this seems a little lazy. Why should the prediction server know that this particular scala.concurrent.timeoutexception is an event server problem?

We can't use the event server's API handler because the UR template uses the event store layer directly.

So it seems a cleaner design to have LEventStore.findByEntity/find, PEventStore.find actually catch all/some/any exceptions that might be raised by their upstream dependencies, (like scala.concurrent.timeoutexception thrown by scala.concurrent.Await) and pass their own custom exceptions downstream to the UR template and onward to the prediction server to handle.

So some new custom exception, like io.prediction.data.store.Common.EventServerTimeoutException that wraps around scala.concurrent.timeoutexception and is cleanly handled by CreateServer.scala within the Prediction Server.

Thoughts?

jimlyndon commented 7 years ago

Just an FYI, my fork simply removes the catch and allows the timeout exception to propagate to P.IO HTTP service, which forwards on a Internal Server error, which our client simply catches and retries N number of times before failing.

I realize the real solution, as I initially suggested, is a bigger conversation with P.IO committers regarding custom exceptions from each part of the product (event server, prediction server, etc) and should probably be considered among other questions surrounding the future road map, before coding up even a few new exception classes, etc.

So at this time I wouldn't consider this issue critical, at least for my personal usage.

pferrel commented 7 years ago

Not sure how this will affect the batch event input. This is when up to 50 events are sent en masse to the EventServer. It is designed to return an array of responses to the array of input events. The responses being HTTP responses like 200.

In order to accept this into Apache PIO you'd have to explain or test this situation to make sure it does something sane. Is that possible?

jimlyndon commented 7 years ago

In order to accept this into Apache PIO you'd have to explain or test this situation to make sure it does something sane. Is that possible?

I haven't made any changes/pull-request againt P.IO yet. The only change I made was to fork template-scala-parallel-universal-recommendation and allow the timeout exception to go unchecked: https://github.com/Massdrop/template-scala-parallel-universal-recommendation/commit/57c10da6c3c395578ebafd910007ce42e858ea32

My hesitation to modify P.IO is that I think the real fix is to provide custom exceptions to each "service": event service, prediction server service, etc. - I'm following the eventual micro service decoupling referred to in the P.IO roadmap doc.

BUT, what I think I'll do is create a PR against your pferrel/fork of P.IO, with a custom event for the event server timeout. etc., and send it your way to get feedback on. Probably sometime in the new year.

Not sure how this will affect the batch event input

I'll also make sure I cover that in the aforementioned PR.

pferrel commented 7 years ago

I encourage you to discuss on the Apache dev list, Donald Szeto is the authority on things like this and is very open to improvements. I'm sure he'll have better ideas than I will.

When it comes to the UR it's my baby so PRs to it are welcome here but from now on the repo has changed to one that supports Apache PIO here: https://github.com/pferrel/universal-recommender

pferrel / template-scala-parallel-universal-recommendation

LEventStore timeout is caught instead of propagated #23