Connection establishment errors introduced with `2.3.12`

RobinHolzingerQC commented 10 months ago

I am using SQLAlchemy (2.0) to maintain connections to an MS SQL Server instance in Python. Under the hood unixodbc is used as a driver when the scripts are run under Linux.

With the release of 2.3.12 I noticed some multiprocessed (Python multiprocessing) applications failing with sqlalchemy/unixodbc errors that did not appear to be deterministic. After further debugging, I found out that this problem is also occurring in single-threaded applications that try to establish many connections in a short amount of time via SqlAlchemy having pooling enabled.

After building unixodbc at different commits on main and integrating these builds into my Python environment I was able to narrow down the issue to Fix possible seg faults with SQLAPI and pooling. As this commit also has significant changes with regard to pooling it is very likely that this is indeed the root of my problems.

The error messages that I’m facing look like the following:

E       pyodbc.OperationalError: ('08S01', '[08S01] [Microsoft][ODBC Driver 18 for SQL Server]TCP Provider: Error code 0x2746 (10054) (SQLExecDirectW)')
...
>       raise exc.ResourceClosedError("This Connection is closed")
E       sqlalchemy.exc.ResourceClosedError: This Connection is closed

../../../<…>…/lib/python3.10/site-packages/sqlalchemy/engine/base.py:655: ResourceClosedError

Most of these errors have in common that they contain some TCP Provider error (mostly error code 0x2…).

I also drafted a small example showing the problems.

Before diving deeper into the inner workings of unixodbc and the specifics of commit 2a73009 I wanted to ask if anyone has an idea where this could be coming from. @lurcher Any idea?

In case anyone else is experiencing these issues, please let me know.

lurcher commented 10 months ago

Ok, so I have been trying to get to the bottom here. What is failing is the SSL connection fails, and it seems to be the case (at least from what I can see) that it fails when a connection is reused via the pool from one thread to another. The driver manager is fine with that, but the driver (both Easysoft and Microsoft) both seem to fail in the same way when the SSL connection is passed between threads (seems to fail with a invalid MAC in the TLS data).

Its not perfect, but I have altered the pooling code to not not pass connections between processes. After doing that it seems to be working fine with the test case. Its a shame that we don't seem to be able to do this safely, but its the way it seems to be.

The change is back in git and in unixODBC-2.3.13pre.tar.gz on the ftp site. Give that a go and see if it works for you.

if the MS folk have any further info about SSL and the driver please let me know.

lurcher commented 10 months ago

On 19/09/2023 15:49, Nick Gorham wrote:

Ok, so I have been trying to get to the bottom here. What is failing is the SSL connection fails, and it seems to be the case (at least from what I can see) that it fails when a connection is reused via the pool from one thread to another. The driver manager is fine with that, but the driver (both Easysoft and Microsoft) both seem to fail in the same way when the SSL connection is passed between threads (seems to fail with a invalid MAC in the TLS data).

Its not perfect, but I have altered the pooling code to not not pass connections between processes. After doing that it seems to be working fine with the test case. Its a shame that we don't seem to be able to do this safely, but its the way it seems to be.

The change is back in git and in unixODBC-2.3.13pre.tar.gz on the ftp site. Give that a go and see if it works for you.

if the MS folk have any further info about SSL and the driver please let me know.

Ok, think I now know whats going on. Seems that Python makes use of magic memory, I can see that the connections in the pool seem to live in different memory spaces, and because of this when the child end and the main process takes over (in the test application) the memory pool is rewound. This is leaking connections that were in the child pool, and also it can mean that a connection that was created in the main thread, may have been used in the child thread, and then when time is rewound (that's what it looks like to the process), that connection that was used is now back in the state of being unused again.

The driver can get away with that, as there were nothing in process, so it just carries along talking to the server on the same socket. However if SSL is in use, the replay detection in TLS kicks in as the server at the other end spots the fact that the client has gone backwards in time. Then when the server reports that back, the child sees that the server is also out of sequence. TLS shuts down the connection at this point and it fails as we have seen.

Bottom line is, I don't think the driver manager pooling is doing anything wrong here, its just not going to play with what python is doing. It worked in previous versions as the pooling was not sharing the connections as it is now, so in effect, it was working by luck.

RobinHolzingerQC commented 10 months ago

Thanks @lurcher for tackling this issue so fast!

To me, it looks as if bb7ed6e already solved a fair chunk of the TCP issues if was encountering. However, there are still a couple of ResourceClosed / TCP related failures in my application (not the minimal example on gist) when using the most recent commit on main.

v-chojas commented 10 months ago

Are you actually forking the process and trying to reuse connections that way? Yes, as @lurcher says, that will cause a bunch of problems. However, multiple threads within the same process taking their turn with a single connection handle (as long as it's not concurrent) shouldn't be a problem.

lurcher commented 10 months ago

On 20/09/2023 15:42, v-chojas wrote:

Are you actually forking the process and trying to reuse connections that way? Yes, as @lurcher https://github.com/lurcher says, that will cause a bunch of problems. However, multiple threads within the same process taking their turn with a single connection handle (as long as it's not concurrent) shouldn't be a problem.

Message ID: @.***>

I think that's the problem, behind the simple looking application, the process is forking the child.

lurcher commented 10 months ago

On 20/09/2023 15:55, Nick Gorham wrote:

On 20/09/2023 15:42, v-chojas wrote:

Are you actually forking the process and trying to reuse connections that way? Yes, as @lurcher https://github.com/lurcher says, that will cause a bunch of problems. However, multiple threads within the same process taking their turn with a single connection handle (as long as it's not concurrent) shouldn't be a problem.

Message ID: @.***>

I think that's the problem, behind the simple looking application, the process is forking the child.

I would have saved myself a lot of time if I had just realized it was forking in the first place :-(.

RobinHolzingerQC commented 10 months ago

I would have saved myself a lot of time if I had just realized it was forking in the first place :-(.

Sorry for the confusion, yes Python multiprocessing Pools fork processes. As Python does not have 'true' multithreading due to the GIL (Global Interpreter Lock), multiple processes are needed to truly parallelize in Python.

Are you actually forking the process and trying to reuse connections that way?

I am forking the processes (or rather Python does), but I do not really intend to reuse the connections by doing that. As I am creating new sqlalchemy engines for every process, I thought that the connections wouldn't be reused.

lurcher commented 10 months ago

On 20/09/2023 16:03, Nick Gorham wrote:

On 20/09/2023 15:55, Nick Gorham wrote:

On 20/09/2023 15:42, v-chojas wrote:

Are you actually forking the process and trying to reuse connections that way? Yes, as @lurcher https://github.com/lurcher says, that will cause a bunch of problems. However, multiple threads within the same process taking their turn with a single connection handle (as long as it's not concurrent) shouldn't be a problem.

Message ID: @.***>

I think that's the problem, behind the simple looking application, the process is forking the child.

I would have saved myself a lot of time if I had just realized it was forking in the first place :-(.

The problem is caused by a connection being opened and closed from the main process. That then gets copied over the fork each time it creates a child because it gets pushed back into the pool. This connection then gets reused in the child each time. Simple to see the problem once you realize whats going on.

RobinHolzingerQC commented 9 months ago

@lurcher Thanks for the clarification.

Did I understand correctly that currently there's no way of getting the pooling to work on Python and you therefore rolled back the changes that were made with 2.3.12 in that regard?

I am asking as I am still facing TCP 0x20 (ResourceClosedErrors) issues when running the most recent commit of unixobc's main branch (0x9b31cd2).

lurcher commented 9 months ago

On 26/09/2023 14:22, Robin Holzinger wrote:

@lurcher https://github.com/lurcher Thanks for the clarification.

Did I understand correctly that currently there's no way of getting the pooling to work on Python and you therefore rolled back the changes that were made with |2.3.12| in that regard?

I am asking as I am still facing |TCP 0x20 (ResourceClosedErrors)| issues when running the most recent commit of unixobc's the main branch (|0x9b31cd2|).

Pooling should (AFAIK) be fine in python now, as long as you only try and pool within the same process with the same DSN. What your code was doing was creating a pooled connection in the main code, then that pooled connection was being used in the child processes, which was breaking that connection especially when SSL was used, as it would see the reversion and throw the error your seeing.

I have not been making changes to 2.3.12, that's released and fixed. All the changes have been in 2.3.13pre. What I did at one point was add code that stopped the pooling working across threads and processes, this solved your problem as it effectively stopped the pooling. But I rolled that back (in 2.3.13pre) because it would have affected users pooing in threaded applications when there was no need for that.

What I suggested that I think will work for you is to use a different DSN (same details, but because the name is different it won't get mixed up in the pooling lists) in the main process and the clients to prevent them getting reused in child processes.

Other than that, the way Python is doing its multiprocessing is not compatible with connection pooing.

jabbera commented 9 months ago

I am forking the processes (or rather Python does), but I do not really intend to reuse the connections by doing that. As I am creating new sqlalchemy engines for every process, I thought that the connections wouldn't be reused.

Change your multiprocessing start type to spawn. It's much slower but, it will not copy memory and you should avoid this issue: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

greatvovan commented 4 months ago

Python does not have 'true' multithreading due to the GIL (Global Interpreter Lock), multiple processes are needed to truly parallelize in Python

Are you doing some heavy computations in your script? While you are correct for CPU-bound tasks, Python multithreading is fine to workaround IO-wait, that is common when working with databases.

lurcher / unixODBC

Connection establishment errors introduced with `2.3.12` #149