jabbera commented 1 year ago

I'm sorry if this is not the correct place to start with this issue, but this library happens to be at the bottom of the offending stack.

There appears to be some sort of incompatiblity between pyodbc, unixodbc, connection pooling, and long connection strings.

Here is the stack trace when running under valgrind

**1** *** memmove_chk: buffer overflow detected ***: program terminated
==1==    at 0x484E7CC: ??? (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1==    by 0x4853323: __memmove_chk (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1==    by 0x531B54B: search_for_pool (in /opt/conda/lib/libodbc.so.2.0.0)
==1==    by 0x533AB3E: SQLDriverConnectW (in /opt/conda/lib/libodbc.so.2.0.0)
==1==    by 0x52EC8E2: Connect (connection.cpp:114)
==1==    by 0x52EC8E2: Connection_New(_object*, bool, bool, long, bool, _object*, Object&) (connection.cpp:286)
==1==    by 0x52F6D7F: mod_connect(_object*, _object*, _object*) (pyodbcmodule.cpp:553)
==1==    by 0x4FE1A6: cfunction_call (methodobject.c:543)
==1==    by 0x4F7B8A: _PyObject_MakeTpCall (call.c:215)
==1==    by 0x4F428C: UnknownInlinedFun (abstract.h:112)
==1==    by 0x4F428C: UnknownInlinedFun (abstract.h:99)
==1==    by 0x4F428C: UnknownInlinedFun (abstract.h:123)
==1==    by 0x4F428C: UnknownInlinedFun (ceval.c:5891)
==1==    by 0x4F428C: _PyEval_EvalFrameDefault (ceval.c:4231)
==1==    by 0x594B71: UnknownInlinedFun (pycore_ceval.h:46)
==1==    by 0x594B71: _PyEval_Vector (ceval.c:5065)
==1==    by 0x594AB6: PyEval_EvalCode (ceval.c:1134)
==1==    by 0x5C6E56: run_eval_code_obj (pythonrun.c:1291)

You can reproduce this by running:

docker run jabberaa/spark-odbc-issue:latest

The repro source code is here:

https://github.com/jabbera/spark-odbc-issue

Happy to try any other techniques you can suggest to get to the bottom of the issue.

Thanks for your help.

jabbera commented 1 year ago

Trace:

[ODBC][8][1672885744.938200][__handles.c][499] Exit:[SQL_SUCCESS] Environment = 0x1f92a50 [ODBC][8][1672885744.938247][SQLSetEnvAttr.c][189] Entry: Environment = 0x1f92a50 Attribute = SQL_ATTR_ODBC_VERSION Value = 0x3 StrLen = 4 [ODBC][8][1672885744.938282][SQLSetEnvAttr.c][381] Exit:[SQL_SUCCESS] [ODBC][8][1672885744.938294][SQLAllocHandle.c][395] Entry: Handle Type = 2 Input Handle = 0x1f92a50 UNICODE Using encoding ASCII 'UTF-8' and UNICODE 'UCS-2LE'

[ODBC][8][1672885744.938341][SQLAllocHandle.c][531] Exit:[SQL_SUCCESS] Output Handle = 0x1f85100 [ODBC][8][1672885744.939162][SQLDriverConnectW.c][298] Entry: Connection = 0x1f85100 Window Hdl = (nil) Str In = [Driver=a;HOST=THISDOESNTMATTERREALLY.azuredatabricks.net;PORT=443;SparkServerType=3;AuthMech=11;ThriftTransport=2;SSL=1;HTTPPat...][length = 2083 (SQL_NTS)] Str Out = (nil) Str Out Max = 0 Str Out Ptr = (nil) Completion = 0

jabbera commented 1 year ago

I compiled with address sanitization:

================================================================= ==16==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61b000010928 at pc 0x7fcf01ee994e bp 0x7fff55ae80b0 sp 0x7fff55ae7860 WRITE of size 1976 at 0x61b000010928 thread T0

0 0x7fcf01ee994d in __interceptor_memcpy /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827

#1 0x7fcefe2b66d4 in copy_nts (/opt/conda/lib/python3.10/site-packages/../../libodbc.so.2+0x566d4)
#2 0x7fcefe2b9852 in search_for_pool (/opt/conda/lib/python3.10/site-packages/../../libodbc.so.2+0x59852)
#3 0x7fcefe327419 in SQLDriverConnectW (/opt/conda/lib/python3.10/site-packages/../../libodbc.so.2+0xc7419)
#4 0x7fcefed298e2 in Connect src/connection.cpp:114
#5 0x7fcefed298e2 in Connection_New(_object*, bool, bool, long, bool, _object*, Object&) src/connection.cpp:286
#6 0x7fcefed33d7f in mod_connect src/pyodbcmodule.cpp:553
#7 0x4fe1a6 in cfunction_call /usr/local/src/conda/python-3.10.8/Objects/methodobject.c:543
#8 0x4f7b8a in _PyObject_MakeTpCall /usr/local/src/conda/python-3.10.8/Objects/call.c:215
#9 0x4f428c in _PyObject_VectorcallTstate /usr/local/src/conda/python-3.10.8/Include/cpython/abstract.h:112
#10 0x4f428c in _PyObject_VectorcallTstate /usr/local/src/conda/python-3.10.8/Include/cpython/abstract.h:99
#11 0x4f428c in PyObject_Vectorcall /usr/local/src/conda/python-3.10.8/Include/cpython/abstract.h:123
#12 0x4f428c in call_function /usr/local/src/conda/python-3.10.8/Python/ceval.c:5891
#13 0x4f428c in _PyEval_EvalFrameDefault /usr/local/src/conda/python-3.10.8/Python/ceval.c:4231
#14 0x594b71 in _PyEval_EvalFrame /usr/local/src/conda/python-3.10.8/Include/internal/pycore_ceval.h:46
#15 0x594b71 in _PyEval_Vector /usr/local/src/conda/python-3.10.8/Python/ceval.c:5065
#16 0x594ab6 in PyEval_EvalCode /usr/local/src/conda/python-3.10.8/Python/ceval.c:1134
#17 0x5c6e56 in run_eval_code_obj /usr/local/src/conda/python-3.10.8/Python/pythonrun.c:1291
#18 0x5c1d3f in run_mod /usr/local/src/conda/python-3.10.8/Python/pythonrun.c:1312
#19 0x45adf1 in pyrun_file /usr/local/src/conda/python-3.10.8/Python/pythonrun.c:1208
#20 0x5bc25e in _PyRun_SimpleFileObject /usr/local/src/conda/python-3.10.8/Python/pythonrun.c:456
#21 0x5bc062 in _PyRun_AnyFileObject /usr/local/src/conda/python-3.10.8/Python/pythonrun.c:90
#22 0x5b8e7c in pymain_run_file_obj /usr/local/src/conda/python-3.10.8/Modules/main.c:357
#23 0x5b8e7c in pymain_run_file /usr/local/src/conda/python-3.10.8/Modules/main.c:376
#24 0x5b8e7c in pymain_run_python /usr/local/src/conda/python-3.10.8/Modules/main.c:591
#25 0x5b8e7c in Py_RunMain /usr/local/src/conda/python-3.10.8/Modules/main.c:670
#26 0x587c28 in Py_BytesMain /usr/local/src/conda/python-3.10.8/Modules/main.c:1090
#27 0x7fcf01bb2d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#28 0x7fcf01bb2e3f in __libc_start_main_impl ../csu/libc-start.c:392
#29 0x587add  (/opt/conda/bin/python3.10+0x587add)

jabbera commented 1 year ago

Oof. Looks like there is an inherent limit of 1024 characters for the connection string that assumed pretty much all over the place in the code. We are running into the same issue that was reported to Microsoft here: https://github.com/dotnet/runtime/issues/63630

After initially declining to fix the issue they did after the customer let them know it was for accessing Databricks via the Simba Spark ODBC driver. We have to attach a JWT token to the connection string and that itself is well over 1024 characters.

lurcher commented 1 year ago

My first guess is its this structure in driver_manager.h

typedef struct connection_pool_head { struct connection_pool_head *next;

char driver_connect_string[ 1024 ]; int dsn_length; char server[ 128 ]; int server_length; char user[ 128 ]; int user_length; char password[ 128 ]; int password_length;

int num_entries; / always at least 1 / CPOOLENT *entries; } CPOOLHEAD;

Could you try changing your local unixODBC copy to increase that and check it fixes your problem. If so, it may be worth making driver_connect_string a char * and allocating it rather than being a fixed buffer.

lurcher commented 1 year ago

Looking a bit further, there is also a limit of 1k in SQLDriverConnect when the connection string is generated by the GUI, but I don;t think thats going to happen now as the GUI is AFAIK not used now.

jabbera commented 1 year ago

Moved conversation to here: https://github.com/lurcher/unixODBC/pull/129

lurcher commented 1 year ago

On 05/01/2023 12:49, Mike wrote:

Moved conversation to here: #129 https://github.com/lurcher/unixODBC/pull/129

— Reply to this email directly, view it on GitHub https://github.com/lurcher/unixODBC/issues/128#issuecomment-1372173751, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYK62MIQYBCU2WDVOMDPYTWQ27NDANCNFSM6AAAAAATRN24VY. You are receiving this because you commented.Message ID: @.***>

I have some code to commit that should work dynamically. I will commit and you can check, should do the same as your patch

jabbera commented 1 year ago

This needs to be dynamically allocated as well.

https://github.com/lurcher/unixODBC/blob/46b117957c5402316e91797a01ed1b6a1183ac57/DriverManager/drivermanager.h#L384

lurcher commented 1 year ago

On 05/01/2023 13:13, Mike wrote:

This needs to be dynamically allocated as well.

https://github.com/lurcher/unixODBC/blob/46b117957c5402316e91797a01ed1b6a1183ac57/DriverManager/drivermanager.h#L384

— Reply to this email directly, view it on GitHub https://github.com/lurcher/unixODBC/issues/128#issuecomment-1372201172, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYK62KCNNTRNFPLAJVC7ODWQ3CHLANCNFSM6AAAAAATRN24VY. You are receiving this because you commented.Message ID: @.***>

Good catch. Though I am not sure that member is used now. Further changed pushed.

jabbera commented 1 year ago

Your fix didn't work for me. My compiler must not work the same way yours does. The definition of calloc is implementation specific if size is 0 per the few sites I read. I'd assume that's the issue. It's been ages since I've done this stuff.

I had to do calloc(1, <size of allocation) or swap the params. The first seems to be standard used in other parts of the code.

Second, the additional member discussed is most certainly used. Here is the stack from sanitization without the second commit you made:

 #0 0x7f98922710dd in __set_local_attribute (/opt/conda/lib/python3.10/site-packages/../../libodbc.so.2+0x1110dd)
    #1 0x7f9892272d43 in __set_local_attributes (/opt/conda/lib/python3.10/site-packages/../../libodbc.so.2+0x112d43)
    #2 0x7f98921acc01 in __connect_part_one (/opt/conda/lib/python3.10/site-packages/../../libodbc.so.2+0x4cc01)
    #3 0x7f9892228465 in SQLDriverConnectW (/opt/conda/lib/python3.10/site-packages/../../libodbc.so.2+0xc8465)
    #4 0x7f9892b648e2 in Connect src/connection.cpp:114
    #5 0x7f9892b648e2 in Connection_New(_object*, bool, bool, long, bool, _object*, Object&) src/connection.cpp:286
    #6 0x7f9892b6ed7f in mod_connect src/pyodbcmodule.cpp:553

Here is the diff I needed to get me going based on current master:

diff --git a/DriverManager/SQLConnect.c b/DriverManager/SQLConnect.c
index 76f3f23..aa793ab 100644
--- a/DriverManager/SQLConnect.c
+++ b/DriverManager/SQLConnect.c
@@ -3625,13 +3625,13 @@ disconnect_and_remove:
             copy_nts( newhead -> user, user_name, &newhead -> user_length, name_length2 );
             copy_nts( newhead -> password, authentication, &newhead -> password_length, name_length3 );
             if ( connect_string == NULL ) {
-                newhead -> _driver_connect_string = calloc( 1, 0 );
+                newhead -> _driver_connect_string = calloc( 1, 1 );
             }
             else if ( connect_string_length < 0 ) {
-                newhead -> _driver_connect_string = calloc( strlen( connect_string ) + 1, 0 );
+                newhead -> _driver_connect_string = calloc( 1, strlen( connect_string ) + 1 );
             }
             else {
-                newhead -> _driver_connect_string = calloc( connect_string_length + 1, 0 );
+                newhead -> _driver_connect_string = calloc( 1, connect_string_length );
             }
             copy_nts( newhead -> _driver_connect_string, connect_string, &newhead -> dsn_length, connect_string_length );

Note: I removed the +1 on the case of a defined connection_string_length because copy_nts uses memcpy for that. No additional null needed.

lurcher commented 1 year ago

On 05/01/2023 14:24, Mike wrote:

Your fix didn't work for me. My compiler must not work the same way yours does. The definition of calloc is implementation specific if size is 0 per the few sites I read. I'd assume that's the issue. It's been ages since I've done this stuff.

That was just stupid code from me, fix committed. Sorry.

jabbera commented 1 year ago

That fixes my issue! Thanks. Any idea when you will be cutting a release? I notice you don't do them very often.

lurcher commented 1 year ago

On 05/01/2023 14:45, Mike wrote:

That fixes my issue! Thanks. Any idea when you will be cutting a release? I notice you don't do them very often.

I prefer stability over lots of releases. It takes some time for the distros to update anyway. So no real plans but always up for being convinced.

jabbera commented 1 year ago

It takes some time for the distros to update anyway.

This is exactly my concern. Databricks is really pushing their SQL offering as a convenient way to access the output of spark processes that are generated from the spark jobs. (It's what got me going down this path). Most enterprise users are going to want to use the oauth token based interface as opposed to a PAT/secret since it's way easier to manage and support. Tokens themselves are over 1k in length. If it takes months for this to make it's way into the distros and into conda feeds, releasing it once critical mass starts running into it will mean even longer delays and ugly workarounds.

Selfishly I don't know how I'm going to solve this problem for the general case. If one of our quants/data scientists installs our internal access package it pulls in unixodbc from conda. I don't exactly know how I'm going to override the packaged versions of unixodbc with the a custom build.

lurcher commented 1 year ago

On 05/01/2023 15:58, Mike wrote:

|It takes some time for the distros to update anyway. |

This is exactly my concern. Databricks is really pushing their SQL offering as a convenient way to access the output of spark processes that are generated from the spark jobs. (It's what got me going down this path). Most enterprise users are going to want to use the oauth token based interface as opposed to a PAT/secret since it's way easier to manage and support. If it takes months for this to make it's way into the distros and into conda feeds, releasing it once critical mass starts running into it will mean even longer delays and ugly workarounds.

Selfishly I don't know how I'm going to solve this problem for the general case. if one of our quants/data scientists installs our internal package it pulls in unixodbc from conda. I don't exactly know how I'm going to override the packaged versions of unixodbc with the a custom build.

— Reply to this email directly, view it on GitHub https://github.com/lurcher/unixODBC/issues/128#issuecomment-1372403520, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYK62IQMIIBVI5M3OKVSS3WQ3VS7ANCNFSM6AAAAAATRN24VY. You are receiving this because you commented.Message ID: @.***>

Does it have any SQLSertConnectAttr option to pass in the token instead of the connection string.

jabbera commented 1 year ago

It looks like yes, but only if you initially passed in a token in the connection string. I also don't know how to do this from pyodbc which would be one of the most common use cases I'd assume. It's also not really documented anywhere: https://learn.microsoft.com/en-us/azure/databricks/integrations/jdbc-odbc-bi#--authentication-parameters

jabbera commented 1 year ago

FWIW I looked through the pyodbc code and even if that was a supported method (Removing the Auth_AccessToken causes the driver to throw saying it's required) I don't see how I could do this in the pyodbc space.

lurcher commented 1 year ago

On 05/01/2023 17:07, Mike wrote:

FWIW I looked through the pyodbc code and even if that was a supported method (Removing the Auth_AccessToken causes the driver to throw saying it's required) I don't see how I could do this in the pyodbc space.

— Reply to this email directly, view it on GitHub https://github.com/lurcher/unixODBC/issues/128#issuecomment-1372488626, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYK62KY3EZ5PLE7OUTTNC3WQ35UDANCNFSM6AAAAAATRN24VY. You are receiving this because you commented.Message ID: @.***>

I just when and reminded myself how we (Easysoft) did it with OAuth and Spark. We allow the access token to be in the DSN as well as the connection string. But I didn't notice the problem as the access tokens we had were only 20 characters or so.

jabbera commented 1 year ago

That (dsn in odbc.ini) also wouldn't solve my issue. We need to make sure users in a shared compute environment are using their own credentials. Our shared library abstracts this from our users so they can basically do:

ourdatabricks.read(select blah)

under the hood it gets an oauth token for the user and crafts a custom connection string for them.

lurcher commented 1 year ago

On 05/01/2023 17:24, Mike wrote:

That also wouldn't solve my issue. We need to make sure users in a shared compute environment are using their own credentials. Our shared library extracts this from our users so they can basically do:

ourdatabricks.read(select blah)

under the hood it gets an oauth token for the user and crafts a custom connection string for them.

Fair enough. Sounds like a job for Kerberos, but anyhow, there is the fix now in git.

jabbera commented 1 year ago

My life was so much easier before oauth!

lurcher / unixODBC

buffer overflow in search_for_pool #128

0 0x7fcf01ee994d in __interceptor_memcpy /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827