Closed rupurt closed 9 months ago
I added polars
and used pyodbc
with polars.read_database(...)
and interestingly it seems to handle the string columns correctly on both LUW & Mainframe Db2.
I can confirm there are no NULL
's and no empty strings. They look like this 500 Sep 2023 - FWK3
What operating system are you executing arrow-odbc
on? I would have expected the non-windows systems to emit the following warning:
Ignoring indicators, because we expect the ODBC driver of your database to return garbage memory. We can not distiguish between empty strings and NULL. Everything is empty.
I am seeing that the name of your Database system is now DB2
. In another instance I thought it to be DB2/LINUX
did something change?
One reason we could see this error, is that I only activate the db2 workarounds if the system identifies as DB2/LINUX
. Just using DB2
avoids activating the workaround. It might very well be, that because of the broken indicators, some columns are now interpreted as NULL.
Found it: Here in this issue your database system name is different: https://github.com/pacman82/odbc-api/issues/398
Do you have any insight why that might be?
Other than that, probably little effort on my side to just activate the workarounds for both odbc2parquet
and arrow-odbc
whenever it starts with DB2
.
I'm executing arrow-odbc
on Linux/x86_64. But the error is only happening on the mainframe z/OS version of Db2. I don't seem to have the problem with the LUW version of Db2 if that helps.
Let me know what other kind of meta information you need and I can provide it as long as it doesn't leak any sensitive information.
arrow-odbc 2.1.5
has been released. It would activate the db2 workaround for any database which name start with db2. This should make it work on platforms besides linux
Please tell me if this fixes your problem?
@pacman82 would it be possible to selectively enable or disable these quirks? While the libdb2.so driver has incorrect data with null fields, the libdb2o.so driver correctly indicates between empty strings and nulls.
I guess it would. However I wonder, if we could auto-detect this? What database managment system name is logged if using libdb2o.so
? As I am not familiar with db2, what is the difference between these?
The problem with querying the database for the dbms name is that it depends on the what the database returns, not the driver.
The issue with checking for "DB2", "DB2 Linux", "DB2 AIX" or any other combination is that it is reported by the database, not the driver.
The closest thing I was able to produce auto-detection was calling SQLDrivers
and SQLDataSources
pulling the data from the connection string looking for either a "Driver" or "Driver64" line.
This also led to an issue I found with odbc_api::environment::Environment
and the drivers
function only returning the first attribute, but I can post more about that later when I have a moment to type all that out.
Thanks for the explanation. I did not realize it was solely the Database itself controlling that string. For me this means that I would actually walk back on auto detecting and provide users a way to specify these quirks themselves.
@keraion Could you enlighten me a bit why there is a libdb2.so
and a libdb2o.so
?
@pacman82 would it be possible to selectively enable or disable these quirks? While the libdb2.so driver has incorrect data with null fields, the libdb2o.so driver correctly indicates between empty strings and nulls.
arrow-odbc 3.0.0
it allows you to set the parameter driver_returns_memory_garbage_for_indicators
explicitly. It won't try to autodetect.
Best, Markus
@rupurt Does setting this parameter explicitly to True
solve this issue for you?
From what I can tell with libdb2.so, it is a 32-bit driver so it is trying to return everything with the 32-bit sizes, but with odbc-api
correctly trying to read data in based on the system's "bitness" it's jumping over data and doubling up on types like isize
.
Example query:
with t1 (col1) as (values 'hello', 'bonjour', '', NULL) select cast(col1 as varchar(50)) from t1;
libdb2.so:
DEBUG - ODBC driver reported for column 0. Relational type: Varchar { length: Some(50) }; Nullability: Nullable; Name: '1';
DEBUG - SQLColAttribute called with attribute 'ConciseType' for column '1' reported 12.
DEBUG - SQLColAttribute called with attribute 'DisplaySize' for column '1' reported 50.
DEBUG - Relational type of column 0: Varchar { length: Some(50) }
INFO - Column '1'
Bytes used per row: 209
INFO - Total memory usage per row for single transit buffer: 209
DEBUG - Indicator Size:30064771077
DEBUG - Indicator Size as 32-bit mask:5
DEBUG - Indicator Size:-4294967296
DEBUG - Indicator Size as 32-bit mask:0
thread '<unnamed>' panicked at ...
libdb2o.so:
DEBUG - ODBC driver reported for column 0. Relational type: Varchar { length: Some(50) }; Nullability: Nullable; Name: '1';
DEBUG - SQLColAttribute called with attribute 'ConciseType' for column '1' reported 12.
DEBUG - SQLColAttribute called with attribute 'DisplaySize' for column '1' reported 50.
DEBUG - Relational type of column 0: Varchar { length: Some(50) }
INFO - Column '1'
Bytes used per row: 209
INFO - Total memory usage per row for single transit buffer: 209
DEBUG - Indicator Size:5
DEBUG - Indicator Size as 32-bit mask:5
DEBUG - Indicator Size:7
DEBUG - Indicator Size as 32-bit mask:7
DEBUG - Indicator Size:0
DEBUG - Indicator Size as 32-bit mask:0
DEBUG - Indicator Size:-1
DEBUG - Indicator Size as 32-bit mask:4294967295
DEBUG - StringArray
[
"hello",
"bonjour",
"",
null,
]
More Docs on libdb2.so vs libdb2o.so: https://www.ibm.com/docs/en/db2/11.5?topic=environment-setting-up-odbc-linux-unix
https://www.ibm.com/docs/en/db2/11.5?topic=unix-sample-configurations
The odbcinst.ini file is in the bin subdirectory of the unixODBC driver manager installation path. The Driver entry for each database section must point to the location where Db2 libraries are located. There are multiple library files that are associated with Db2 ODBC driver. You must specify a Db2 library based on your application environment.
- For a 32-bit unixODBC driver manager on the AIX® platform, you must specify the libdb2.a library for the Driver entry.
- For a 32-bit unixODBC driver manager on the Linux and UNIX platform other than AIX, you must specify the libdb2.so library for the Driver entry.
- For a 64-bit unixODBC driver manager on the AIX platform, you must specify the db2o.o library for the Driver entry.
- For a 64-bit unixODBC driver manager on the Linux and UNIX platform other than AIX, you must specify the libdb2o.so library for the Driver entry.
@keraion Thanks a lot for the insight! This explains a lot. I would argue though that this is not a matter of interpretation. IMHO the 64Bit Len is the right one and the others just messed up. @rupurt Would all your problems just go away if you used libdb2o.so
as a driver?
I need to reflect a bit whether or not this allows me to come up with a fundamental better workaround. This can bite in so many ways.
@pacman82 @keraion I can confirm switching to libdb2o.so
on Linux & libdb2o.dylib
on Mac fixes this issue against the LUW & Mainframe versions of Db2. The flag you added for garbage data also seems to work (doesn't raise an error anymore).
Sorry for all the trouble!
All good, we are all learning all the time.
Haha for sure. I've learned so much about low level database driver performance. Priceless :smile:
You've been amazing btw @pacman82. Really really appreciate it!
This is a new issue based on comments in https://github.com/pacman82/arrow-odbc-py/issues/68?notification_referrer_id=NT_kwDOAApjVbE4Nzc5NTYwNDI4OjY4MDc4OQ#issuecomment-1910894395
Given a Db2 table with a column that has empty strings And the column does not allow
NULL
's When I try to read the table into apyarrow
record batch Then it should be mapped correctly But it currently causes a panicTable definition