Open mikedhanson opened 1 week ago
When the service is running, can you check the state of Hangfire? If you go to localhost:5000/hangfire, I'm curious about the number of queued jobs. I've seen similar errors, albeit not crashing, happen when there were thousands or millions of jobs queued in hangfire and it couldn't process that fast enough.
In the enqueued jobs, is there is specific type that is queued? heartbeat, groom etc? Is it on a queue of an online machine?
One quick way to work around the situation is to truncate the Hangfire.Job table: https://support.ironmansoftware.com/portal/en/kb/articles/kb0077-startup-failure-of-powershell-universal-server-in-multi-node-sql-environment
Skimming the jobs they look to all be related to ExecutionService.Execute
There is two queues
Top one is a machine that is technically online but not on this sql db anymore. We had to revert when we ran into another unrelated issue.
The bottom one is the another computer I must have been doing some testing on at some point.
I am not seeing the queue for localhost. However, in PSU, I see the correct computer as the only computer.
Could jobs be queueing up behind the scenes on a "computer/queue" that doesnt exist anymore in the sql instance?
Running the hangfile cleanup on the db now. @adamdriscoll
Shouldnt the queues be tied to a computer/node in PSU? if you remove a computer shouldnt that queue go away?
It should be. I'll leave this issue open to see if we can figure out why that isn't happening.
Version
4.4.0
Severity
Critical
Environment
msi
Steps to Reproduce
I have noticed an instance of PSU im hosting will randomly crash with the following error in the event logs
and this in the systemlogs
Expected behavior
Actual behavior
Additional Environment data
Using MSI install with sql hosted in azure
Screenshots/Animations
No response