Open Tagl opened 1 year ago
Do you have your sample code that I can try to help track down the problem?
On 16/05/2023 14:16, Arnar Bjarni Arnarson wrote:
I have connection pooling enabled with the Microsoft ODBC Driver for SQL Server. I have a multithreaded application where each thread except the main thread opens and closes connections in sequence. It seems that the same pooled connection is being closed twice in the pool search. This causes the program to crash in close_pooled_connection with the error displayed as: double free detected in tcache 2 Without connection pooling, there is no issue.
I'm using C++ with another library, SQLAPI++, but following the problem lead me to the closed_pooled_connection function here. I have tried isolating the connection code within my program with a mutex, since I thought it was a thread safety issue originally, but the issue seems to occur even with isolation.
My odbcinst.ini file:
Do you have a way I can reproduce the problem then I will see if I can find a fix for it?
Yes, sorry for not including that in the original post. Here is a sample program that creates multiple threads, each with a connection. It also uses the library I mentioned before. I've run something similar to this (tweaked number of thread pools / threads / iterations per thread / added mutexes) a hundred times and almost every single time it crashes due to this error within 15 minutes. I have not noticed any particular pattern regarding when it crashes, seems rather random, but it's always in the same code path. I have closed source software in which I discovered this where it generally crashes after around 5 minutes.
#include <SQLAPI.h> // main SQLAPI++ header
#include <algorithm>
#include <chrono>
#include <iostream>
#include <thread>
#include <vector>
using namespace std;
void connect_task(int pool_id, int thread_id) {
for (int j = 0; j < 1; j++) {
SAConnection con;
try {
con.Connect("Server=someserver;Database=somedb;Driver={ODBC Driver 17 for SQL Server}", "someuser", "somepass", SA_ODBC_Client);
auto random_wait = std::chrono::milliseconds(rand() % 100);
this_thread::sleep_for(random_wait);
if (con.isConnected()) {
con.Disconnect();
}
}
catch (SAException& x)
{
try
{
con.Rollback();
}
catch (SAException&)
{
}
cout << "Error text: " << x.ErrText().GetMultiByteChars() << endl;
}
catch (...)
{
cout << "unknown exception" << endl;
}
}
}
int main(int argc, char* argv[])
{
setlocale(LC_ALL, "");
for (int j = 0; j < 1000; j++) {
vector<std::thread> threads;
for (int i = 0; i < 1000; i++) {
threads.emplace_back(connect_task, j, i);
auto random_wait = std::chrono::milliseconds(rand() % 100);
this_thread::sleep_for(random_wait);
}
for (auto& t : threads) {
t.join();
}
}
return 0;
}
On 16/05/2023 17:47, Arnar Bjarni Arnarson wrote:
Yes, sorry for not including that in the original post. Here is a sample program that creates multiple threads, each with a connection. It also uses the library I mentioned before. I've run something similar to this (tweaked number of thread pools / threads / iterations per thread / added mutexes) a hundred times and almost every single time it crashes due to this error within 15 minutes. I have not noticed any particular pattern regarding when it crashes, seems rather random, but it's always in the same code path. I have closed source software in which I discovered this where it generally crashes after around 5 minutes.
Ok, I can reproduce it, looks like its a race condition once the DM starts to expire connections from the pool. Will see what I can work out.
On 16/05/2023 19:12, Nick Gorham wrote:
On 16/05/2023 17:47, Arnar Bjarni Arnarson wrote:
Yes, sorry for not including that in the original post. Here is a sample program that creates multiple threads, each with a connection. It also uses the library I mentioned before. I've run something similar to this (tweaked number of thread pools / threads / iterations per thread / added mutexes) a hundred times and almost every single time it crashes due to this error within 15 minutes. I have not noticed any particular pattern regarding when it crashes, seems rather random, but it's always in the same code path. I have closed source software in which I discovered this where it generally crashes after around 5 minutes.
Ok, I can reproduce it, looks like its a race condition once the DM starts to expire connections from the pool. Will see what I can work out.
OK, So I think I can see the problem. Its one where I think I need some opinions. The basic issue is SQLAPI issues multiple calls to SQLAllocEnv and then does pooing. Now as it stands the DM tries to match multiple env and pooling, but its basically failing. The failure starts once the pool entries start to timeout. Once they timeout the DM starts to disconnect them, but the disconnect is getting messed up with the various env handles. Now there is a lot of code in there to try and allow pooling and multiple envs and multiple drivers to be mixed and matched. Now Windows seems to avoid the problem by forcing a shared environment handle when pooling is used.
I can add shared handles, but the DM also tries very hard to avoid people reporting that there is still memory allocated and in use when the program exits, so its all back to counting usages.
There is a lot of code around pooling that looks a lot more complex than it probably needs to be. We have things like pooling timeout and usage counts that I think windows avoids. I think a lot of this can be simplified if we implement shared env handles when pooling is enabled.
Part of the reported problem is a combination of release_env being called when connections reach the pooling timeout value.
What do folk think about this and suggestions of directions of travel?
Hello, I helped locate a memory leak in UnixODBC about a year ago while using my multithreaded server with ODBC connection pooling. I am no longer using this multithreaded server w/connection pooling but I recall that I had no problem using ODBC API following Microsoft ODBC docs advice for connection pooling. I had a stress test those days, so I could check the memory leak was gone using Valgrind, and there was no error exiting the program after running thousands of requests and using the pool a lot.
You must set SQL_ATTR_CONNECTION_POOLING and a shared hEnv among threads, in my case I used a global variable hEnv:
rc = SQLSetEnvAttr( NULL, SQL_ATTR_CONNECTION_POOLING,
(SQLPOINTER)SQL_CP_ONE_PER_HENV , 0 ); if ( rc != SQL_SUCCESS ) std::cerr << "[ODBC] Failed to set connection pooling option" << "\n";
rc = SQLAllocHandle ( SQL_HANDLE_ENV, SQL_NULL_HANDLE, &henv ); if ( rc != SQL_SUCCESS ) { std::cerr << "[ODBC] Failed to allocate shared environment" << "\n"; throw std::runtime_error("Cannot initialize ODBC"); }
I don't know if this sheds some light on what may be happening with that SQLAPI library, but I did not have any problem using ODBC API and threads, with each thread managing it own connection using the shared hEnv, this hEnv was freed on program exit.
Regards, Martin
On Wed, May 17, 2023 at 5:41 AM Nick Gorham @.***> wrote:
On 16/05/2023 19:12, Nick Gorham wrote:
On 16/05/2023 17:47, Arnar Bjarni Arnarson wrote:
Yes, sorry for not including that in the original post. Here is a sample program that creates multiple threads, each with a connection. It also uses the library I mentioned before. I've run something similar to this (tweaked number of thread pools / threads / iterations per thread / added mutexes) a hundred times and almost every single time it crashes due to this error within 15 minutes. I have not noticed any particular pattern regarding when it crashes, seems rather random, but it's always in the same code path. I have closed source software in which I discovered this where it generally crashes after around 5 minutes.
Ok, I can reproduce it, looks like its a race condition once the DM starts to expire connections from the pool. Will see what I can work out.
OK, So I think I can see the problem. Its one where I think I need some opinions. The basic issue is SQLAPI issues multiple calls to SQLAllocEnv and then does pooing. Now as it stands the DM tries to match multiple env and pooling, but its basically failing. The failure starts once the pool entries start to timeout. Once they timeout the DM starts to disconnect them, but the disconnect is getting messed up with the various env handles. Now there is a lot of code in there to try and allow pooling and multiple envs and multiple drivers to be mixed and matched. Now Windows seems to avoid the problem by forcing a shared environment handle when pooling is used.
I can add shared handles, but the DM also tries very hard to avoid people reporting that there is still memory allocated and in use when the program exits, so its all back to counting usages.
There is a lot of code around pooling that looks a lot more complex than it probably needs to be. We have things like pooling timeout and usage counts that I think windows avoids. I think a lot of this can be simplified if we implement shared env handles when pooling is enabled.
Part of the reported problem is a combination of release_env being called when connections reach the pooling timeout value.
What do folk think about this and suggestions of directions of travel?
— Reply to this email directly, view it on GitHub https://github.com/lurcher/unixODBC/issues/141#issuecomment-1551077252, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW34RADOPTPO2MT4KGAXQT3XGSMKZANCNFSM6AAAAAAYDUQAEU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
On 17/05/2023 12:50, Martín Córdova wrote:
Hello, I helped locate a memory leak in UnixODBC about a year ago while using my multithreaded server with ODBC connection pooling. I am no longer using this multithreaded server w/connection pooling but I recall that I had no problem using ODBC API following Microsoft ODBC docs advice for connection pooling. I had a stress test those days, so I could check the memory leak was gone using Valgrind, and there was no error exiting the program after running thousands of requests and using the pool a lot.
You must set SQL_ATTR_CONNECTION_POOLING and a shared hEnv among threads, in my case I used a global variable hEnv:
Yep, in this case the lib is using multiple Env's
I remember that the only limitation, regarding ODBC docs, was that UnixODBC does not support SQL_CP_ONE_PER_DRIVER, the constant does not exist in UnixODBC headers.
On Wed, May 17, 2023 at 8:00 AM Nick Gorham @.***> wrote:
On 17/05/2023 12:50, Martín Córdova wrote:
Hello, I helped locate a memory leak in UnixODBC about a year ago while using my multithreaded server with ODBC connection pooling. I am no longer using this multithreaded server w/connection pooling but I recall that I had no problem using ODBC API following Microsoft ODBC docs advice for connection pooling. I had a stress test those days, so I could check the memory leak was gone using Valgrind, and there was no error exiting the program after running thousands of requests and using the pool a lot.
You must set SQL_ATTR_CONNECTION_POOLING and a shared hEnv among threads, in my case I used a global variable hEnv:
Yep, in this case the lib is using multiple Env's
— Reply to this email directly, view it on GitHub https://github.com/lurcher/unixODBC/issues/141#issuecomment-1551265245, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW34RABQ3ZUT4ZNKJPMN5Q3XGS4WHANCNFSM6AAAAAAYDUQAEU . You are receiving this because you commented.Message ID: @.***>
On 17/05/2023 13:14, Martín Córdova wrote:
I remember that the only limitation, regarding ODBC docs, was that UnixODBC does not support SQL_CP_ONE_PER_DRIVER, the constant does not exist in UnixODBC headers.
Ok, so I have been looking at this and found and fixed a couple of problems in the DM that the test code shows. I have not put that back in git yet, but its on the ftp site now if the OP wants to try it.
However, I think there is also a race condition in SQLAPI that is causing fatal problems the DM is unlikely to be able to catch. I see the application via the lib calling SQLFreeHandle on a env handle, and then using the same (now free'd) handle to call SQLAllocHandle and SQLDriverConnect. Its possible that it gets away with this (more by luck than judgement) on Windows because of the shared env handle used with pooling. I have a snipped of a DM log that shows the problem happening:
[ODBC][1701236][1684403048.206901][SQLFreeHandle.c][220] Entry: Handle Type = 1 Input Handle = 0x7f7014003a50 [ODBC][1701236][1684403048.207035][SQLAllocHandle.c][395] Entry: Handle Type = 2 Input Handle = 0x7f7014003a50 [ODBC][1701236][1684403048.207059][SQLAllocHandle.c][511] Exit:[SQL_SUCCESS] Output Handle = 0x7f6ffc000d40 [ODBC][1701236][1684403048.207093][SQLDriverConnect.c][751] Entry: Connection = 0x7f6ffc000d40
I am assuming because of the threaded nature of the application, the code in SQLFreeHandle has not yet got to the point of releasing the handle so the handle validation code hasn't had a chance to mark the handle as invalid so its not caught on the following calls, but I think its a question to ask the SQLAPI folk. As its a closed source commercial lib, I guess its best that the OP ask's them the question in the first case.
Thank you very much for the quick responses. I can raise the concerns to the maintainer of SQLAPI.
I'll try the updated version later today.
On 18/05/2023 11:33, Arnar Bjarni Arnarson wrote:
Thank you very much for the quick responses. I can raise the concerns to the maintainer of SQLAPI.
I'll try the updated version later today.
— Reply to this email directly, view it on GitHub https://github.com/lurcher/unixODBC/issues/141#issuecomment-1552859354, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYK62LYR5XK43PP3R53RF3XGX3I7ANCNFSM6AAAAAAYDUQAEU. You are receiving this because you commented.Message ID: @.***>
Thanks. The version on ftp now will catch this condition and report to stderr that its happened before returning SQL_INVALID_HANDLE
That's great, I'll make sure to turn on tracing to try to spot it.
On 18/05/2023 11:38, Arnar Bjarni Arnarson wrote:
That's great, I'll make sure to turn on tracing to try to spot it.
It will still catch and display the error without tracing. Though it will also happen with tracing on. Part of the problem is, being a race condition, adding tracing may alter the order this happen in and when it happens.
I've tried it and now it no longer crashes but it seems as if connection pooling isn't really active. Before the fix, "connecting" to the database usually took 0ms since the connection was already alive, now with the same settings connecting seems to take 7-13ms which is the same speed I saw before enabling connection pooling.
I feel like the changes you described here should not have serious performance impacts so a new connection is being established each time. Any idea why that would happen after these changes?
I have the same config as in the original post and I have it both in /etc/odbcinst.ini
and /usr/local/etc/odbcinst.ini
On 19/05/2023 15:11, Arnar Bjarni Arnarson wrote:
I've tried it and now it no longer crashes but it seems as if connection pooling isn't really active. Before the fix, "connecting" to the database usually took 0ms since the connection was already alive, now with the same settings connecting seems to take 7-13ms which is the same speed I saw before enabling connection pooling.
I feel like the changes you described here should not have serious performance impacts so a new connection is being established each time. Any idea why that would happen after these changes?
I have the same config as in the original post and I have it both in |/etc/odbcinst.ini| and |/usr/local/etc/odbcinst.ini|
|What I see the test application (via the lib) doing is it creates an environment, then creates a couple of connections on that env, then disconnect them and return to the pool, it then releases the environment. What I have changed is that releasing the application environment releases the pooled connections. It then creates a new application env and a few connections, closes them and releases the env. There is some pooling going on but then releasing the environment clears that collection of connections out.|
|Now, it may be that its possible that pooled connections could be safely shared between application environments, but the change I made prevents this from happening at the moment as it was partly this that was causing the crash.|
|I suspect (but cant know as its closed source) that SQLAPI gets away by mixing and matching application environments on windows because the environment is shared. I did try adding shared environments with usage counting but the lib runs the usage count down to zero so the DM would release the env as it does now so there is no real advantage.|
Message ID: <lurcher/unixODBC/issues/141/15546485
On 19/05/2023 15:11, Arnar Bjarni Arnarson wrote:
I've tried it and now it no longer crashes but it seems as if connection pooling isn't really active. Before the fix, "connecting" to the database usually took 0ms since the connection was already alive, now with the same settings connecting seems to take 7-13ms which is the same speed I saw before enabling connection pooling.
I feel like the changes you described here should not have serious performance impacts so a new connection is being established each time. Any idea why that would happen after these changes?
I have the same config as in the original post and I have it both in |/etc/odbcinst.ini| and |/usr/local/etc/odbcinst.ini|
— Reply to this email directly, view it on GitHub https://github.com/lurcher/unixODBC/issues/141#issuecomment-1554648538, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYK62PS5NKYE6VXS4BINETXG55S5ANCNFSM6AAAAAAYDUQAEU. You are receiving this because you commented.Message ID: @.***>
Try the one there now (on ftp) this implements a shared env handle when pooling is used. It does mean that the env and pooled connection will be there on program exit, but it cant release these if the shard env is in place as there is no real end.
I will try it on Monday morning as I'm on holiday right now. I can avoid the usage count going down by maintaining an extra SAConnection instance as a means of "Keep Alive" for the pool. Worked with the sample code at least.
Try the one there now (on ftp) this implements a shared env handle when pooling is used. It does mean that the env and pooled connection will be there on program exit, but it cant release these if the shard env is in place as there is no real end.
The new version you uploaded works as well. No extra connection object required on my end to keep the pool alive.
On 22/05/2023 11:21, Arnar Bjarni Arnarson wrote:
Try the one there now (on ftp) this implements a shared env handle when pooling is used. It does mean that the env and pooled connection will be there on program exit, but it cant release these if the shard env is in place as there is no real end.
The new version you uploaded works as well. No extra connection object required on my end to keep the pool alive.
— Reply to this email directly, view it on GitHub https://github.com/lurcher/unixODBC/issues/141#issuecomment-1556958830, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYK62PTDPNLNJ4H6C4TY5DXHM4ZBANCNFSM6AAAAAAYDUQAEU. You are receiving this because you commented.Message ID: @.***>
Ok, So the latest version creates a single env if pooing is in use, I think this matches what happens on Windows. But the only down side is that its then not possible to close things down before exit. Given the shared env stuff is enabled by defining SHARED_POOLED_ENV in drivermanager.h, and that you have a solution to the same problem, my instinct is always do the least possible to address problems, so I think I would leave it undefined and you can either make your own build, or use your extra connection object.
I think the main body of changes however do address the original problem that was causing the seg fault.
Just what change did you make to the test code to generate the extra connection? I would be interested in seeing if it fixes the use after release problem I saw in the SQLAPI lib. The shared env does hid this, but doesn't address the underlying issue. I am guessing the windows shared env has hidden the problem up to now to the lib maintainer.
Will leave it a bit and then push back into git unless anyone else has a view on this?
Just what change did you make to the test code to generate the extra connection? I would be interested in seeing if it fixes the use after release problem I saw in the SQLAPI lib. The shared env does hid this, but doesn't address the underlying issue. I am guessing the windows shared env has hidden the problem up to now to the lib maintainer.
Just change the main function like so:
int main(int argc, char* argv[])
{
setlocale(LC_ALL, "");
SAConnection keepAlive;
keepAlive.Connect("Server=someserver;Database=somedb;Driver={ODBC Driver 17 for SQL Server}", "someuser", "somepass", SA_ODBC_Client);
for (int j = 0; j < 1000; j++) {
vector<std::thread> threads;
for (int i = 0; i < 1000; i++) {
threads.emplace_back(connect_task, j, i);
auto random_wait = std::chrono::milliseconds(rand() % 100);
this_thread::sleep_for(random_wait);
}
for (auto& t : threads) {
t.join();
}
}
keepAlive.Disconnect();
return 0;
}
On 22/05/2023 11:44, Arnar Bjarni Arnarson wrote:
SAConnection keepAlive; keepAlive.Connect("Server=someserver;Database=somedb;Driver={ODBC Driver 17 for SQL Server}","someuser","somepass", SA_ODBC_Client);
Thanks.
On 22/05/2023 12:20, Nick Gorham wrote:
On 22/05/2023 11:44, Arnar Bjarni Arnarson wrote:
SAConnection keepAlive; keepAlive.Connect("Server=someserver;Database=somedb;Driver={ODBC Driver 17 for SQL
Server}","someuser","somepass", SA_ODBC_Client);
Thanks.
I have pushed the changes back that prevent the seg fault with SQLAPI and this set of operations
With the keepalive connection changes in my company's software we tried running a stress test with v2.3.11 and v2.3.12pre. Both ran without crashing but I saw the memory leak mentioned before on 2.3.11 and the test was cut short due to the machine running out of memory. I also ran 2.3.12pre without keepalive changes and it did not crash. Both the leak and the crash seems to be completely resolved in 2.3.12pre. Thanks for the quick responses, issue is resolved.
I am interested in this fix. Is there an estimate when v2.3.12 will be released? Thanks!
On 02/07/2023 11:00, Guillermo Céspedes Tabárez wrote:
I am interested in this fix. Is there an estimate when v2.3.12 will be released? Thanks!
Not at the moment, but the code is there to be used.
I have connection pooling enabled with the Microsoft ODBC Driver for SQL Server. I have a multithreaded application where each thread except the main thread opens and closes connections in sequence. It seems that the same pooled connection is being closed twice in the pool search. This causes the program to crash in close_pooled_connection with the error displayed as: double free detected in tcache 2 Without connection pooling, there is no issue.
I'm using C++ with another library, SQLAPI++, but following the problem led me to the closed_pooled_connection function here. I have tried isolating the connection code within my program with a mutex, since I thought it was a thread safety issue originally, but the issue seems to occur even with isolation.
My odbcinst.ini file:
I've tried unixODBC versions 2.3.9, 2.3.11 and 2.3.12pre, issue is present in all of them. I'm on Ubuntu 22.04, but tested on RHEL9 as well, issue present on both.