oracle / dotnet-db-samples

.NET code samples for Oracle database developers #OracleDotNet
http://otn.oracle.com/dotnet
MIT License
408 stars 190 forks source link

Any suggestions to best tune connection pooling parameters for big load #332

Closed ronnyek closed 7 months ago

ronnyek commented 7 months ago

I'm back trying to investigate problems we've constantly had trying to get oracle + oracle.manageddataaccess.core to perform. Right now our little load suite pretends to be 50 users trying to hit our api's that execute queries via oracle.manageddataccess.core. I've played with connection pooling parameters all over the place, vastly increase the number of min connections, max connections shrinking and growing increment sizes, but ultimately that load running 50 simultaneous users through some reasonable size workloads and with waits, we can EASILY get our app to start really failing in a bad way.

Ultimately end up getting a load of these exceptions

ORA-50000: Connection request timed out
*** STACK TRACE ***
   at OracleInternal.ConnectionPool.PoolManager`3.Get(ConnectionString csWithDiffOrNewPassw, Boolean bGetForApp, OracleConnection connRefForCriteria, String affinityInstanceName, Boolean bForceMatch, Boolean bAsync)
   at OracleInternal.ConnectionPool.OraclePoolManager.Get(ConnectionString csWithNewPassw, Boolean bGetForApp, OracleConnection connRefForCriteria, String affinityInstanceName, Boolean bForceMatch, Boolean bAsync)
   at OracleInternal.ConnectionPool.OracleConnectionDispenser`3.Get(ConnectionString cs, PM conPM, ConnectionString pmCS, OracleOpaqueString securedPassw, OracleOpaqueString securedProxyPassw, OracleConnection connRefForCriteria, Boolean bAsync)
   at Oracle.ManagedDataAccess.Client.OracleConnectionInternal.OpenInternal(Boolean bAsync)
   at Oracle.ManagedDataAccess.Client.OracleConnectionInternal.Open()
   at Oracle.ManagedDataAccess.Client.OracleConnection.Open()

Even when lowering the speed of queries, and profiling the app, of like 123sec execution time, 73 seconds of that time was spent in getting a connection.

Data Source=(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<host>)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=<service>)));User Id=<user>;Password=<pass>; pooling=true;Max Pool Size=1000;Connection Timeout=30;Min Pool Size=30;Incr Pool Size=10

I've tried lots of stuff and am somewhat at my wits end. dotnet + npgsql driver+ similar workloads don't even bat an eye with this, but I can't for the life of me seem to make any progress whatsoever.

Application is .net 5.0, and library we've tested many version of, but most recently targetting 3.21.120.

We just set pooling=false and ran the tests still get the same errors, but initial results seem to make it seem like performance was massively better.

I've used windows monitors and I can see the pools doing what they are supposed to... with connect pools growing by incr size when it nears a full pool. The periods of the chart are where the app just stopped being responsive image

NOTE: the number of connection pools growing here is as a result of a weird problem in our login process, but only happens when apps hit our site the first time... subsequent requests are handled as expected

I'm at a loss, and will take any advice anyone can give me. Is there any guidelines, for connection pooling parameters and connection handling in a sort of mass scale API type of way? How about things to investigate.

Reference to earlier reports of these problems with more profiling output, but probably using older version of the library https://github.com/oracle/dotnet-db-samples/issues/243#issuecomment-1131643612

alexkeh commented 7 months ago

You could be hitting a bug. An ORA-50000 is atypical for connection timeouts. There's a known internal bug 35955560 with similar symptoms.

I would recommend opening up an Oracle Support service request and sharing a level 7 trace of your app hitting this issue. Oracle can take a look and see what is triggering the ORA-50000.

alexkeh commented 7 months ago

Closing issue for now. I will reopen if we get any new info to continue this investigation.

The problem seems like a bug, but we need more info to diagnose the root cause.