pacman82 / arrow-odbc-py

Read Apache Arrow batches from ODBC data sources in Python
MIT License
60 stars 5 forks source link

Strange issue with strange characters on linux #86

Open aersam opened 8 months ago

aersam commented 8 months ago

with this code:

from arrow_odbc import read_arrow_batches_from_odbc
         reader = read_arrow_batches_from_odbc(
            query=sql,
            connection_string=self.connection_string,
            max_binary_size=20000,
            max_text_size=20000,
        )
        print(sql)
        print(reader.schema)

I get this output:


SELECT [User - iD] AS [User_-_iD], [FirstName] AS [FirstName], [LastName] AS [LastName], [Age] AS [Age], [companyid] AS [companyid], CAST([time stämp] AS BIGINT) AS [time_stämp], CAST(GETUTCDATE() AS datetime2(6)) AS __timestamp, CAST(0 AS BIT) AS __is_deleted, CAST(1 AS BIT) AS __is_full_load FROM [dbo].[user]

User_-_iD: int64 not null
FirstName: string
LastName: string
Age: decimal128(15, 3)
companyid: string not null
time_stäm: int64
__timestamp: timestamp[us]
__is_deleted: bool
__is_full_load: bool

Please note the name of the [time_stämp] column in the schema

aersam commented 8 months ago

Only happens on Linux. Could also be a bug with MSFT's ODBC Driver, but I guess MSFT does not do bugs 😉

pacman82 commented 8 months ago

This does not happen on Windows because on windows arrow-odbc-py is compiled to use UTF-16 encoding to talk with the Database. On Linux and OS-X however the system local is usually UTF-8 and therefore arrow-odbc-py assumes the narrow encoding returned from the database is UTF-8. However, if your System local is configured not to be UTF-8 this assumption is wrong.

So my first piece of advice would be to check whether your System local is UTF-8. Other narrow ASCII encodings are currently not supported by arrow-odbc-py.

aersam commented 8 months ago

Ok, I'll try. For completness, here's the link to my failing gh action: https://github.com/bmsuisse/odbc2deltalake/actions/runs/8538422686

pacman82 commented 8 months ago

Looking at this it is more likely that somehow binary and character size is confused. I just do not know there. Yet, likely the system local is innocent. Would you kindly enable debug logging and execute again. In the beginning it should log the database column names and their type, even before conversion to arrow. I would be interested in seeing that.

pacman82 commented 8 months ago

You can log to standard error using: https://arrow-odbc.readthedocs.io/en/latest/arrow_odbc.html#arrow_odbc.log_to_stderr

aersam commented 8 months ago

Here we go:

DEBUG SQLAllocHandle allocated connection (Dbc) handle '0x55c451f30050'
WARN State: 01S00, Native error: 0, Message: [Microsoft][ODBC Driver 18 for SQL Server]Invalid connection string attribute
WARN State: 01000, Native error: 5701, Message: [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Changed database context to 'db_to_delta_test'.
WARN State: 01000, Native error: 5703, Message: [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Changed language setting to us_english.
DEBUG Database managment system name as reported by ODBC: Microsoft SQL Server
DEBUG ODBC driver reported for column 0. Relational type: BigInt; Nullability: NoNulls; Name: 'User_-_iD';
DEBUG ODBC driver reported for column 1. Relational type: Varchar { length: Some(100) }; Nullability: Nullable; Name: 'FirstName';
DEBUG ODBC driver reported for column 2. Relational type: WVarchar { length: None }; Nullability: Nullable; Name: 'LastName';
DEBUG ODBC driver reported for column 3. Relational type: Decimal { precision: 15, scale: 3 }; Nullability: Nullable; Name: 'Age';
DEBUG ODBC driver reported for column 4. Relational type: Varchar { length: Some(10) }; Nullability: NoNulls; Name: 'companyid';
DEBUG ODBC driver reported for column 5. Relational type: BigInt; Nullability: Nullable; Name: 'time_stäm';
DEBUG ODBC driver reported for column 6. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: '__timestamp';
DEBUG ODBC driver reported for column 7. Relational type: Bit; Nullability: Nullable; Name: '__is_deleted';
DEBUG ODBC driver reported for column 8. Relational type: Bit; Nullability: Nullable; Name: '__is_full_load';
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '2' reported 12.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '2' reported 100.
DEBUG Relational type of column 1: Varchar { length: Some(100) }
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '3' reported -9.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '3' reported 0.
DEBUG Relational type of column 2: WVarchar { length: None }
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '3' reported 0.
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '5' reported 12.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '5' reported 10.
DEBUG Relational type of column 4: Varchar { length: Some(10) }
INFO Column 'User_-_iD'
Bytes used per row: 8
INFO Column 'FirstName'
Bytes used per row: 409
INFO Column 'LastName'
Bytes used per row: 20009
INFO Column 'Age'
Bytes used per row: 26
INFO Column 'companyid'
Bytes used per row: 49
INFO Column 'time_stäm'
Bytes used per row: 16
INFO Column '__timestamp'
Bytes used per row: 24
INFO Column '__is_deleted'
Bytes used per row: 9
INFO Column '__is_full_load'
Bytes used per row: 9
INFO Total memory usage per row for single transit buffer: 20559
SELECT [User - iD] AS [User_-_iD], [FirstName] AS [FirstName], [LastName] AS [LastName], [Age] AS [Age], [companyid] AS [companyid], CAST([time stämp] AS BIGINT) AS [time_stämp], CAST(GETUTCDATE() AS datetime2(6)) AS __timestamp, CAST(0 AS BIT) AS __is_deleted, CAST(1 AS BIT) AS __is_full_load FROM [dbo].[user]
User_-_iD: int64 not null
FirstName: string
LastName: string
Age: decimal128(15, 3)
companyid: string not null
time_stäm: int64
__timestamp: timestamp[us]
__is_deleted: bool
__is_full_load: bool
DEBUG SQLFreeHandle dropped 0x55c451ef1b20 of type Stmt.
DEBUG SQLFreeHandle dropped 0x55c451f30050 of type Dbc.
pacman82 commented 7 months ago

I could reproduce the Bug.

Only happens on Linux. Could also be a bug with MSFT's ODBC Driver, but I guess MSFT does not do bugs 😉

Actually MSFT has one of the more solid ODBC drivers I would say. I also admire many of the things they achieved in terms of engineering. Yet this time, I think it is on them. It is not a configuration issue of the client local. The umlaut ä is rendered correctly. Yet the string is one letter short. My guess is this has to do with the fact that ä is two bytes in UTF-8, yet the text length is returned in characters not bytes. Anyhow this bug needs to be fixed by Microsoft.

Windows version of of arrow-odbc-py is not affected, because on Windows I choose to use the wide UTF-16 version of the ODBC interface. I am happy to help writing and detailing the issue (however I also suspect that a knowledgeable MS employee would be a ble to reproduce it by looking at this thread alone), but I can not figure out how to report a bug to MS.

Best, Markus

pacman82 commented 7 months ago

As a workaround I'll give it a try to supply larger buffers than requested by the drivers ...

pacman82 commented 7 months ago

To be fair to Microsoft, the function in question SQLDescribeCol is document to return the length in characters not bytes. https://learn.microsoft.com/en-us/sql/odbc/reference/syntax/sqldescribecol-function?view=sql-server-ver16. Yet it is clearly intended to determine buffer lengths (usually expressed in bytes). Driver may be "fine" and the ODBC standard is broken here. Well, anyhow, seems in the narrow case I'll start multiplying its values by 3 and for wide code by 2, this should put it all on the safe side even in worst case scenarios. I need to fix this upstream in odbc-api though.

aersam commented 7 months ago

Hm, interesting. Would you think I can reproduce this behavior if I write it in C#/.Net Core using MSFT-only libs?

pacman82 commented 7 months ago

Maybe not, depends if ADO relies on ODBC. Also relies on how wasteful authors of these libs are then using memory. I could avoid a lot of trouble if I would just allocate 4kb of buffer for each column name you wanna now. In that case you the name would need to be very long and contain special characters.

A minimal reproducing example in plain C, maybe based on an example for SQLDescribeCol would be feasable though. Just watch the string length output argument for a column name whose name differ in byte and character length.

pacman82 commented 7 months ago

NameLengthPtr [Output] Pointer to a buffer in which to return the total number of characters (excluding the null termination) available to return in ColumnName. If the number of characters available to return is greater than or equal to BufferLength, the column name in ColumnName is truncated to BufferLength minus the length of a null-termination character.

This formulation alone in the documents is problematic

pacman82 commented 7 months ago

It confuses binary buffer length and character length. Very likely related to the bug in question.

pacman82 commented 7 months ago

Definitly this. If you want to see it fail under windows too, you can use a character in the column name which is 4 Bytes in UTF-16 instead of two. E.g. 𐐏. Also MariaDB has the same bug; PostgreSQL is fine though.

pacman82 commented 7 months ago

Sadly allocating larger buffers is not a suitable workaround. The drivers only copy the bytes in the length of character data in the application provided buffers. The "good" news is that this makes the bug more relevant to fix.

Sorry, I cannot help here. I really tried.

aersam commented 7 months ago

Thank you for all your efforts! I'll try creating a repo in C# and getting this to the correct guys at Redmond - it's an interesting bug, after all 🙂

pacman82 commented 7 months ago

Probably not even needed, just stating that SQLDescribeCol truncates column names those binary length exceeds the character length, is a precise description, which should enable them to reproduce.

pacman82 commented 7 months ago

Remark: Using SQLColAttribute instead of SQLDescribeCol might work.