pacman82 / arrow-odbc-py

Read Apache Arrow batches from ODBC data sources in Python
MIT License
59 stars 5 forks source link

Add option to use 64 bits for all floats? (difference in behavior compared to pyodbc) #73

Closed chriscomeau79 closed 9 months ago

chriscomeau79 commented 10 months ago

I noticed I had some Float64 values coming through in Float32 precision and narrowed it down to the wrong precision/scale being reported by the ODBC driver.

With this test query:

arrow_odbc.log_to_stderr(3)
...
query = """select 
cast(123456789.123456789 as decimal(38, 20)) as original_decimal, 
cast('double', cast(123456789.123456789 as decimal(38, 20))) as cast_to_double
"""

reader = arrow_odbc.read_arrow_batches_from_odbc(
    connection_string="DSN=...", 
    query=query
)
t = pa.Table.from_batches(reader, reader.schema)
display(t)

(it's not related to the syntax used here - same thing happens with different variants and double-precision columns in tables)

I see these types in the arrow-odbc debug output:

DEBUG - ODBC Environment created.
DEBUG - SQLAllocHandle allocated connection (Dbc) handle '0x1fc1c1c9b00'
WARN - State: 01000, Native error: 0, Message: [Microsoft][ODBC Driver Manager] The driver doesn't support the version of ODBC behavior that the application requested (see SQLSetEnvAttr).
DEBUG - Database managment system name as reported by ODBC: PostgreSQL
DEBUG - ODBC driver reported for column 0. Relational type: Numeric { precision: 38, scale: 20 }; Nullability: Nullable; Name: 'original_decimal';
DEBUG - ODBC driver reported for column 1. Relational type: Float { precision: 17 }; Nullability: Nullable; Name: 'cast_to_double';
INFO - Column 'original_decimal'
Bytes used per row: 49
INFO - Column 'cast_to_double'
Bytes used per row: 12
INFO - Total memory usage per row for single transit buffer: 61

It looks like the type indicated by their driver is wrong - should be more like 53 for a double instead of the 17 we're getting.

I get this result, with 123456790 as the result of Float32 conversion:

pyarrow.Table
original_decimal: decimal128(38, 20)
cast_to_double: float
----
original_decimal: [[123456789.12345678900000000000]]
cast_to_double: [[123456790]]

If I provide a schema, it works:

schema = pa.schema([
('original_decimal',pa.decimal128(38,20)),
('cast_to_double','double')
])
reader = arrow_odbc.read_arrow_batches_from_odbc(
    connection_string="DSN=...",
    query=query,
    schema=schema
)
t = pa.Table.from_batches(reader, schema)
display(t)
INFO - Column 'original_decimal'
Bytes used per row: 49
INFO - Column 'cast_to_double'
Bytes used per row: 16
INFO - Total memory usage per row for single transit buffer: 65
pyarrow.Table
original_decimal: decimal128(38, 20)
to_double: double
----
original_decimal: [[123456789.12345678900000000000]]
to_double: [[123456789.12345679]]

The surprising part here is pyodbc works:

with pyodbc.connect(...) as conn:
    with conn.cursor() as cursor:
        rs = cursor.execute(query).fetchall()
for c in cursor.description:
    print(c)
print(rs)
('original_decimal', <class 'decimal.Decimal'>, None, 38, 38, 20, True)
('cast_to_double', <class 'float'>, None, 17, 17, 0, True)
[(Decimal('123456789.12345678900000000000'), 123456789.12345679)]

This is happening with the Denodo ODBC driver, which they say is based on the Postgres ODBC driver 09.05. I suspect the same thing probably happens with other databases. I'm guessing this kind of issue is widespread enough that pyodbc just ignores the float precision indicated by ODBC and always uses 64 bits.

To fix: maybe add an option to always use 64-bit for floats as a workaround? (could consider changing the default behavior to do the same thing as pyodbc, but that could break things and would be less efficient for other drivers that do things correctly here)

chriscomeau79 commented 10 months ago

I noticed this message is also logged during the initial connection, but couldn't find which version was being requested vs available. This is on Windows:

WARN - State: 01000, Native error: 0, Message: [Microsoft][ODBC Driver Manager] The driver doesn't support the version of ODBC behavior that the application requested (see SQLSetEnvAttr)

pacman82 commented 10 months ago

Hello @chriscomeau79 ,

thanks for the extensive description of the issue. I would need a bit of time to look into it. Maybe I can reproduce the issue with PostgreSQL. I can make no statement as to then I will get to that. I am also thinking about the solution a bit. I have a piece of code which can automatically activate workarounds based on database names. Since the database identifies as PostgreSQL (maybe a copy paste error from the Denodo ODBC driver?) I could test this given the issue can be reproduced and make this happen automatically. Alternatively I could give users the option to manipulate the schema in some ways. Need to think about this. Thanks again for the extensive description.

Best, Markus

chriscomeau79 commented 10 months ago

No problem, I have workarounds for this one - can leave it as a low priority or just close it so the notes are here in case it comes up for anyone else with another database. I'd be very surprised if it's reproducible with postgres rather than being something about Denodo's implementation.

pacman82 commented 10 months ago

So Postgres does report a precision of 0 for variables of TYPE double precision. It does not know double. FYI: I am not sure the precision of double is defined in SQL. The ODBC documentation states that ODBC Double data type is usually written as Double Precision in SQL.

Knowing that arrow-odbc can not sensibly auto detect and adapt to the behavior, I want the user to be able to provide a method mapping an inferred schema to a new one. This way users can rely on the auto-detection for generic applications and still provide workarounds.

Best Markus

chriscomeau79 commented 10 months ago

Yup, I double checked and that cast('double', 1.0) style in the original post is just something Denodo added as a shorthand, so forget about 'double' there, it's still landing on 'double precision'. Here's a simpler example you can use which works in both Denodo and Postgres.

select cast(12345678.12345678 as double precision) as test_float64, 
cast(12345678.12345678 as real) as test_float32;

Log output for running that one:

DEBUG - ODBC Environment created.
DEBUG - SQLAllocHandle allocated connection (Dbc) handle '0x26171709470'
WARN - State: 01000, Native error: 0, Message: [Microsoft][ODBC Driver Manager] The driver doesn't support the version of ODBC behavior that the application requested (see SQLSetEnvAttr).
DEBUG - Database managment system name as reported by ODBC: PostgreSQL
DEBUG - ODBC driver reported for column 0. Relational type: Float { precision: 17 }; Nullability: Nullable; Name: 'test_float64';
DEBUG - ODBC driver reported for column 1. Relational type: Real; Nullability: Nullable; Name: 'test_float32';
INFO - Column 'test_float64'
Bytes used per row: 12
INFO - Column 'test_float32'
Bytes used per row: 12
INFO - Total memory usage per row for single transit buffer: 24
pacman82 commented 9 months ago

Hello @chriscomeau79 , just released arrow-odbc 4.0.0. It allows you to specify a callable for mapping arrow schemas. You might be especially interested in the example demonstrated by this integration test:

https://github.com/pacman82/arrow-odbc-py/blob/fb7a7d9f6d22b723559854d76b84724756b5f601/tests/test_arrow_odbc.py#L694

chriscomeau79 commented 9 months ago

Will give it a shot, thanks!