pacman82 / arrow-odbc-py

Read Apache Arrow batches from ODBC data sources in Python
MIT License
57 stars 5 forks source link

Db2 VARCHAR column with empty string causes panic #78

Closed rupurt closed 9 months ago

rupurt commented 9 months ago

This is a new issue based on comments in https://github.com/pacman82/arrow-odbc-py/issues/68?notification_referrer_id=NT_kwDOAApjVbE4Nzc5NTYwNDI4OjY4MDc4OQ#issuecomment-1910894395

Given a Db2 table with a column that has empty strings And the column does not allow NULL's When I try to read the table into a pyarrow record batch Then it should be mapped correctly But it currently causes a panic

> evac odbc sql -c 'SELECT NUM FROM MYSCHEMA.MYTABLE' -v
DEBUG - ODBC Environment created.
DEBUG - SQLAllocHandle allocated connection (Dbc) handle '0x21976e0'
DEBUG - Database managment system name as reported by ODBC: DB2
DEBUG - ODBC driver reported for column 0. Relational type: Varchar { length: Some(20) }; Nullability: NoNulls; Name: 'NUM';
DEBUG - SQLColAttribute called with attribute 'ConciseType' for column '1' reported 12.
DEBUG - SQLColAttribute called with attribute 'DisplaySize' for column '1' reported 20.
DEBUG - Relational type of column 0: Varchar { length: Some(20) }
INFO - Column 'NUM'
Bytes used per row: 89
INFO - Total memory usage per row for single transit buffer: 89
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-odbc-6.1.0/src/reader/to_record_batch.rs:103:85:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Column 'NUM' is declared as non-nullable but contains null values")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted

Table definition

CREATE TABLE "MYSCHEMA"."MYTABLE" (
          "ID" INTEGER NOT NULL WITH DEFAULT  , 
          "NUM" VARCHAR(20 OCTETS) NOT NULL WITH DEFAULT  , 
          "DTE" DATE NOT NULL WITH DEFAULT '9999-09-09'
) ;
rupurt commented 9 months ago

I added polars and used pyodbc with polars.read_database(...) and interestingly it seems to handle the string columns correctly on both LUW & Mainframe Db2.

I can confirm there are no NULL's and no empty strings. They look like this 500 Sep 2023 - FWK3

pacman82 commented 9 months ago

What operating system are you executing arrow-odbc on? I would have expected the non-windows systems to emit the following warning:

Ignoring indicators, because we expect the ODBC driver of your database to return garbage memory. We can not distiguish between empty strings and NULL. Everything is empty.

I am seeing that the name of your Database system is now DB2. In another instance I thought it to be DB2/LINUX did something change?

One reason we could see this error, is that I only activate the db2 workarounds if the system identifies as DB2/LINUX. Just using DB2 avoids activating the workaround. It might very well be, that because of the broken indicators, some columns are now interpreted as NULL.

pacman82 commented 9 months ago

Found it: Here in this issue your database system name is different: https://github.com/pacman82/odbc-api/issues/398

Do you have any insight why that might be?

Other than that, probably little effort on my side to just activate the workarounds for both odbc2parquet and arrow-odbc whenever it starts with DB2.

rupurt commented 9 months ago

I'm executing arrow-odbc on Linux/x86_64. But the error is only happening on the mainframe z/OS version of Db2. I don't seem to have the problem with the LUW version of Db2 if that helps.

Let me know what other kind of meta information you need and I can provide it as long as it doesn't leak any sensitive information.

pacman82 commented 9 months ago

arrow-odbc 2.1.5 has been released. It would activate the db2 workaround for any database which name start with db2. This should make it work on platforms besides linux

pacman82 commented 9 months ago

Please tell me if this fixes your problem?

keraion commented 9 months ago

@pacman82 would it be possible to selectively enable or disable these quirks? While the libdb2.so driver has incorrect data with null fields, the libdb2o.so driver correctly indicates between empty strings and nulls.

pacman82 commented 9 months ago

I guess it would. However I wonder, if we could auto-detect this? What database managment system name is logged if using libdb2o.so? As I am not familiar with db2, what is the difference between these?

keraion commented 9 months ago

The problem with querying the database for the dbms name is that it depends on the what the database returns, not the driver.

The issue with checking for "DB2", "DB2 Linux", "DB2 AIX" or any other combination is that it is reported by the database, not the driver.

The closest thing I was able to produce auto-detection was calling SQLDrivers and SQLDataSources pulling the data from the connection string looking for either a "Driver" or "Driver64" line.

This also led to an issue I found with odbc_api::environment::Environment and the drivers function only returning the first attribute, but I can post more about that later when I have a moment to type all that out.

pacman82 commented 9 months ago

Thanks for the explanation. I did not realize it was solely the Database itself controlling that string. For me this means that I would actually walk back on auto detecting and provide users a way to specify these quirks themselves.

pacman82 commented 9 months ago

@keraion Could you enlighten me a bit why there is a libdb2.so and a libdb2o.so?

pacman82 commented 9 months ago

@pacman82 would it be possible to selectively enable or disable these quirks? While the libdb2.so driver has incorrect data with null fields, the libdb2o.so driver correctly indicates between empty strings and nulls.

arrow-odbc 3.0.0 it allows you to set the parameter driver_returns_memory_garbage_for_indicators explicitly. It won't try to autodetect.

Best, Markus

pacman82 commented 9 months ago

@rupurt Does setting this parameter explicitly to True solve this issue for you?

keraion commented 9 months ago

From what I can tell with libdb2.so, it is a 32-bit driver so it is trying to return everything with the 32-bit sizes, but with odbc-api correctly trying to read data in based on the system's "bitness" it's jumping over data and doubling up on types like isize.

Example query:

with t1 (col1) as (values 'hello', 'bonjour', '', NULL) select cast(col1 as varchar(50)) from t1;

libdb2.so:

DEBUG - ODBC driver reported for column 0. Relational type: Varchar { length: Some(50) }; Nullability: Nullable; Name: '1';
DEBUG - SQLColAttribute called with attribute 'ConciseType' for column '1' reported 12.
DEBUG - SQLColAttribute called with attribute 'DisplaySize' for column '1' reported 50.
DEBUG - Relational type of column 0: Varchar { length: Some(50) }
INFO - Column '1'
Bytes used per row: 209
INFO - Total memory usage per row for single transit buffer: 209
DEBUG - Indicator Size:30064771077
DEBUG - Indicator Size as 32-bit mask:5
DEBUG - Indicator Size:-4294967296
DEBUG - Indicator Size as 32-bit mask:0
thread '<unnamed>' panicked at ...

libdb2o.so:

DEBUG - ODBC driver reported for column 0. Relational type: Varchar { length: Some(50) }; Nullability: Nullable; Name: '1';
DEBUG - SQLColAttribute called with attribute 'ConciseType' for column '1' reported 12.
DEBUG - SQLColAttribute called with attribute 'DisplaySize' for column '1' reported 50.
DEBUG - Relational type of column 0: Varchar { length: Some(50) }
INFO - Column '1'
Bytes used per row: 209
INFO - Total memory usage per row for single transit buffer: 209
DEBUG - Indicator Size:5
DEBUG - Indicator Size as 32-bit mask:5
DEBUG - Indicator Size:7
DEBUG - Indicator Size as 32-bit mask:7
DEBUG - Indicator Size:0
DEBUG - Indicator Size as 32-bit mask:0
DEBUG - Indicator Size:-1
DEBUG - Indicator Size as 32-bit mask:4294967295
DEBUG - StringArray
[
  "hello",
  "bonjour",
  "",
  null,
]
keraion commented 9 months ago

More Docs on libdb2.so vs libdb2o.so: https://www.ibm.com/docs/en/db2/11.5?topic=environment-setting-up-odbc-linux-unix

    • With the introduction of the 64-bit development environment, there have been a number of inconsistencies among vendors regarding the interpretation of the sizes of certain parameters. For example, the 64-bit Microsoft ODBC Driver Manager treats SQLHANDLE and SQLLEN as both 64-bits in length, whereas Data Direct Connect and open source ODBC driver managers treat SQLHANDLE as 64-bit, but SQLLEN as 32-bit. The developer must therefore pay careful attention to which version of the Db2 driver is required. Specify the appropriate Db2 driver in the data source stanza, according to the following information:
      |Type of application | Db2 driver to specify| |-- | --| |32-bit CLI | libdb2.*| |32-bit ODBC Driver Manager | libdb2.*| |64-bit CLI | libdb2.*| |64-bit ODBC Driver Manager | libdb2o.* (db2o.o for AIX)|
      Note: The file extension of the Db2 driver to specify depends on the operating system. The extensions are as follows:
      • .a - AIX
      • .so - Linux, HP-IPF
      • .sl - HP-PA

https://www.ibm.com/docs/en/db2/11.5?topic=unix-sample-configurations

The odbcinst.ini file is in the bin subdirectory of the unixODBC driver manager installation path. The Driver entry for each database section must point to the location where Db2 libraries are located. There are multiple library files that are associated with Db2 ODBC driver. You must specify a Db2 library based on your application environment.

  • For a 32-bit unixODBC driver manager on the AIX® platform, you must specify the libdb2.a library for the Driver entry.
  • For a 32-bit unixODBC driver manager on the Linux and UNIX platform other than AIX, you must specify the libdb2.so library for the Driver entry.
  • For a 64-bit unixODBC driver manager on the AIX platform, you must specify the db2o.o library for the Driver entry.
  • For a 64-bit unixODBC driver manager on the Linux and UNIX platform other than AIX, you must specify the libdb2o.so library for the Driver entry.
pacman82 commented 9 months ago

@keraion Thanks a lot for the insight! This explains a lot. I would argue though that this is not a matter of interpretation. IMHO the 64Bit Len is the right one and the others just messed up. @rupurt Would all your problems just go away if you used libdb2o.so as a driver?

I need to reflect a bit whether or not this allows me to come up with a fundamental better workaround. This can bite in so many ways.

rupurt commented 9 months ago

@pacman82 @keraion I can confirm switching to libdb2o.so on Linux & libdb2o.dylib on Mac fixes this issue against the LUW & Mainframe versions of Db2. The flag you added for garbage data also seems to work (doesn't raise an error anymore).

Sorry for all the trouble!

pacman82 commented 9 months ago

All good, we are all learning all the time.

rupurt commented 9 months ago

Haha for sure. I've learned so much about low level database driver performance. Priceless :smile:

You've been amazing btw @pacman82. Really really appreciate it!