Open Meyce opened 3 months ago
Hi @Meyce , thanks for reporting your findings!
I've just gone over the code in Rebus.ServiceProvider responsible for setting up "the modular monolith" type of independent Rebus instance, and I could not find any obvious ways that e.g. an intermittent error would be able to crash that particular instance or cause it to stop processing messages in any other way.
If you find any indications of what could be wrong during your investigation, please let me know. I'll go and do a couple of clean-up cycles on the code of Rebus.ServiceProvider to see if I come across anything while cleaning/refactoring.
@mookid8000,
Thanks for checking. The problem came up in production again last night. One of the queues stopped being processed.
I think I'm am starting to see a pattern emerge between the queue that stops getting processed, and some weirdness in the logging during shutdown. It seems that for whichever queue/bus has an issue, during shutdown I start to see some messages related to the shutdown and disposal of that bus, however only a few of the expected messages are written to the log, and then no other logs show related to the bus or the service as a whole. So I'll see the following messages at the tail of the log
Once that last message is written, no other logs are written related to the bus or the service as a whole. There should be more logs just related to the shutdown of the bus, let alone the shutdown of the service.
I'm wonder if that last message is an indication of something Async that is blocking, and is perhaps the reason the bus stopped receiving/processing messages in the first place. I can say that it is not always the same bus that causes the issue. I'm going to look back through the the handlers registered on the bus, to make sure I don't have something that could be blocking. I think it is probably unlikely that it a poorly written handler as the service works fine for a period of time, and it seems like the error would occur in a specific handler consistently which doesn't seem to be the case.
I'll report back after a deeper look into the handlers.
Hi @Meyce , have you looked further into this? Did it happen to you again?
Yes. It’s been occurring about once a week. I set up a hosted service in the test env to send a message to each queue every 15 minutes. Each queue handles the “heartbeat” message and writes to the logs. Set up this way the service has ran without issue for over two weeks.
I’ve got logging turned up to Debug but nothing useful shows up.
Sent from Gmail Mobile
On Mon, Sep 2, 2024 at 4:03 AM Mogens Heller Grabe @.***> wrote:
Hi @Meyce https://github.com/Meyce , have you looked further into this? Did it happen to you again?
— Reply to this email directly, view it on GitHub https://github.com/rebus-org/Rebus.ServiceProvider/issues/93#issuecomment-2324075114, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACILQEVNXSHFVRYHKB3OFFTZUQLTJAVCNFSM6AAAAABMOSWSCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRUGA3TKMJRGQ . You are receiving this because you were mentioned.Message ID: @.***>
I've been encountering an issue where when using the "modular monolith", the individual Rebus services stop processing messages. The app currently uses the following
.AddRebusService
5 times from the Host Builder, to process 5 different queues. All the queues are related to the same domain.The problem I've run into is after the service is up and running successfully for a period of time, some/all queues stop getting processed and messages sit in ASB until the service is restarted. I don't know that this necessarily a Rebus problem, but I thought I would post here to see if anyone has any ideas. Currently it is hard for me to say how long it is before this occurs. This isn't a high volume service and I only notice it after it stops processing messages, which means it could enter a state where it isn't processing, but I don't know about it until a new message arrives. This could potentially be hours since it doesn't do much/any processing outside of regular business hours.
I'm using multiple calls to other
.AddRebusService
in other apps, but those apps are on .Net 6, don't use workload identities, and I don't believe they use OTEL either. I have the logging turned up to debug, and nothing useful is showing in the logs. Nothing that indicates when the services stop processing messages. There's no indication that anything is wrong until messages start backing up in the queues.When it enters this state, there are differences in the logging when the service shuts down. I've noticed the following logging when the service starts and is shutdown successfully. The logging starts the buses in the order they are registered to the HostBuilder and shuts them down in the reverse order. So if the buses were named A,B,C,D,E. When the console app starts you see them started in that order, and when it shuts down they are shutdown in the opposite order (E,D,C,B,A). That seems the be consistent via my observations.
When I've noticed the problem occurs and I shutdown the service, only the last two buses registered via
.AddRebusService
log that they've shutdown. Nothing is logged for the other 3. This the only thing I've been able to find from the logs thus far.Please let me know if you have any ideas. I'm still trying to gather info, but with the logs not supplying anything useful, I'm guessing at what the problem is. My next steps are the following:
.AddRebusService
calls. I don't expect this to do anything but I want to see if the logging output changes when the service stops processing messages.Thanks