rebus-org / Rebus.SqlServer

:bus: Microsoft SQL Server transport and persistence for Rebus
https://mookid.dk/category/rebus
Other
43 stars 44 forks source link

UpdateLease exceptions produce fail-fast exceptions #64

Closed clegendre closed 4 years ago

clegendre commented 4 years ago

Hello,

At this line of code, the UpdateLease can failed, especially when the Bus is overloaded (databases timeout for example: we run a lot of worker - 500 - that polls the same queue). But when it failed, it produces a fail fast exception... Have you considered using Polly for example to manage retries over SQL calls ?

https://github.com/rebus-org/Rebus.SqlServer/blob/b41d2d8688d66f80395924567a023610cb725f13/Rebus.SqlServer/SqlServer/Transport/SqlServerLeaseTransport.cs#L412

mookid8000 commented 4 years ago

(...) when it failed, it produces a fail fast exception...

I can definitely see that that is not optimal.

Have you considered using Polly for example to manage retries over SQL calls ?

I can see that this particular operation could benefit from retrying a couple of times before giving up. It's not given that it would help you in your situation though, as it sounds like your database is too overloaded... when there's contention for a resource, introducing retries will often contribute to worsening the situation.

mookid8000 commented 4 years ago

If the idea of retrying is something you would like to pursue, would you be interested in submitting a PR?

clegendre commented 4 years ago

Maybe just a Try/Catch over this call in the timer to avoid big app crash ? I could submit a PR with this.

mookid8000 commented 4 years ago

Maybe just a Try/Catch over this call in the timer to avoid big app crash ? I could submit a PR with this.

Yes. Since there's not a lot of intelligent things we can do about an overloaded SQL Server, simply catching the error and logging it is probably the best thing to do in this situation. PR is most welcome 😊

MrMDavidson commented 4 years ago

Maybe just a Try/Catch over this call in the timer to avoid big app crash ? I could submit a PR with this.

I'm not sure logging and continuing is the right option here. If the message handling is long running, and the lease hasn't updated, there's a possibility that the lease will expire and run past it's tolerance threshold allowing another worker to pick up the message even though it's still being handled by the original worker.

It might be worthwhile introducing some configurable / user defined handling of this error.

mookid8000 commented 4 years ago

What do you think the renewal task running in the background could do about it if the renewal failed?

MrMDavidson commented 4 years ago

What do you think the renewal task running in the background could do about it if the renewal failed?

Hrm, excellent question... I don't think there's a general one size fits all solution here. Perhaps we can introduce something like;

public interface ILeaseRenewalExceptionHandler {
  void HandleException(Exception ex);
}

And then have the default registration as an implementation such as;

public interface LoggingLeaseRenewalExceptionHandler {
 private ILog _log;

  public LoggingLeaseRenewalExceptionHandler(ILog log) {
    _log = log;
  }

  public void HandleException(Exception ex) {
    /// Can't remember the log API off the top of my head
    _log.Log(ex);
  }
}

But other implementors could do whatever... orchestrate the bus to shutdown, throw the exception, etc.

mookid8000 commented 4 years ago

That would definitely be the most extensible solution, but I think it's too much effort and too much API without a purpose – at least for now.

I would prefer a couple of retries (could be Polly or could be manually implemented to avoid introducing an extra dependency) followed by a simple log statement explaining the situation.

clegendre commented 4 years ago

I'm not sure logging and continuing is the right option here. If the message handling is long running, and the lease hasn't updated, there's a possibility that the lease will expire and run past it's tolerance threshold allowing another worker to pick up the message even though it's still being handled by the original worker.

Since messages processing should be idempotent, it does not matter if the message is processed twice by another worker. I prefer this situation rather than an application crash.

It might be worthwhile introducing some configurable / user defined handling of this error.

Maybe we can simply make UpdateLease method virtual and put a try/catch over the call in the timer tick.

mookid8000 commented 4 years ago

fixed by #65 – out as Rebus.SqlServer 6.1.2 on NuGet.org