ironmansoftware / powershell-universal

Issue tracker for PowerShell Universal
https://powershelluniversal.com
35 stars 3 forks source link

Service randomly crashing - System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. #3844

Open mikedhanson opened 1 week ago

mikedhanson commented 1 week ago

Version

4.4.0

Severity

Critical

Environment

msi

Steps to Reproduce

I have noticed an instance of PSU im hosting will randomly crash with the following error in the event logs

Application: Universal.Server.exe
CoreCLR Version: 7.0.222.60605
.NET Version: 7.0.2
Description: The process was terminated due to an unhandled exception.
Exception Info: System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)

and this in the systemlogs

2024-10-01 03:23:15.829 -05:00 [INF] Groom date is: 9/1/2024 8:23:15 AM
2024-10-01 03:23:17.240 -05:00 [INF] Finished groom job.
2024-10-01 03:24:12.844 -05:00 [INF] Starting heartbeat job.
2024-10-01 03:24:26.640 -05:00 [ERR] Execution Worker is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04
System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
   at Microsoft.Data.SqlClient.SqlConnection.Open(SqlConnectionOverrides overrides)
   at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
   at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](DbConnection dedicatedConnection, Func`2 func)
   at Hangfire.SqlServer.SqlServerJobQueue.DequeueUsingSlidingInvisibilityTimeout(String[] queues, CancellationToken cancellationToken)
   at Hangfire.SqlServer.SqlServerJobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
   at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
   at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state)
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)

Expected behavior

no crash

Actual behavior

Service is crashing

Additional Environment data

Using MSI install with sql hosted in azure

Screenshots/Animations

No response

adamdriscoll commented 1 week ago

When the service is running, can you check the state of Hangfire? If you go to localhost:5000/hangfire, I'm curious about the number of queued jobs. I've seen similar errors, albeit not crashing, happen when there were thousands or millions of jobs queued in hangfire and it couldn't process that fast enough.

mikedhanson commented 1 week ago

image

adamdriscoll commented 1 week ago

In the enqueued jobs, is there is specific type that is queued? heartbeat, groom etc? Is it on a queue of an online machine?

One quick way to work around the situation is to truncate the Hangfire.Job table: https://support.ironmansoftware.com/portal/en/kb/articles/kb0077-startup-failure-of-powershell-universal-server-in-multi-node-sql-environment

mikedhanson commented 1 week ago

Skimming the jobs they look to all be related to ExecutionService.Execute

There is two queues

Top one is a machine that is technically online but not on this sql db anymore. We had to revert when we ran into another unrelated issue.

The bottom one is the another computer I must have been doing some testing on at some point.

I am not seeing the queue for localhost. However, in PSU, I see the correct computer as the only computer.

image

mikedhanson commented 1 week ago

Could jobs be queueing up behind the scenes on a "computer/queue" that doesnt exist anymore in the sql instance?

mikedhanson commented 1 week ago

Running the hangfile cleanup on the db now. @adamdriscoll

Shouldnt the queues be tied to a computer/node in PSU? if you remove a computer shouldnt that queue go away?

image

adamdriscoll commented 1 week ago

It should be. I'll leave this issue open to see if we can figure out why that isn't happening.