Strange issue with strange characters on linux

aersam commented 8 months ago

with this code:

from arrow_odbc import read_arrow_batches_from_odbc
         reader = read_arrow_batches_from_odbc(
            query=sql,
            connection_string=self.connection_string,
            max_binary_size=20000,
            max_text_size=20000,
        )
        print(sql)
        print(reader.schema)

I get this output:


SELECT [User - iD] AS [User_-_iD], [FirstName] AS [FirstName], [LastName] AS [LastName], [Age] AS [Age], [companyid] AS [companyid], CAST([time stämp] AS BIGINT) AS [time_stämp], CAST(GETUTCDATE() AS datetime2(6)) AS __timestamp, CAST(0 AS BIT) AS __is_deleted, CAST(1 AS BIT) AS __is_full_load FROM [dbo].[user]

User_-_iD: int64 not null
FirstName: string
LastName: string
Age: decimal128(15, 3)
companyid: string not null
time_stäm: int64
__timestamp: timestamp[us]
__is_deleted: bool
__is_full_load: bool

Please note the name of the [time_stämp] column in the schema

aersam commented 8 months ago

Only happens on Linux. Could also be a bug with MSFT's ODBC Driver, but I guess MSFT does not do bugs 😉

pacman82 commented 8 months ago

This does not happen on Windows because on windows arrow-odbc-py is compiled to use UTF-16 encoding to talk with the Database. On Linux and OS-X however the system local is usually UTF-8 and therefore arrow-odbc-py assumes the narrow encoding returned from the database is UTF-8. However, if your System local is configured not to be UTF-8 this assumption is wrong.

So my first piece of advice would be to check whether your System local is UTF-8. Other narrow ASCII encodings are currently not supported by arrow-odbc-py.

aersam commented 8 months ago

Ok, I'll try. For completness, here's the link to my failing gh action: https://github.com/bmsuisse/odbc2deltalake/actions/runs/8538422686

pacman82 commented 8 months ago

Looking at this it is more likely that somehow binary and character size is confused. I just do not know there. Yet, likely the system local is innocent. Would you kindly enable debug logging and execute again. In the beginning it should log the database column names and their type, even before conversion to arrow. I would be interested in seeing that.

pacman82 commented 8 months ago

You can log to standard error using: https://arrow-odbc.readthedocs.io/en/latest/arrow_odbc.html#arrow_odbc.log_to_stderr

aersam commented 8 months ago

Here we go:

DEBUG SQLAllocHandle allocated connection (Dbc) handle '0x55c451f30050'
WARN State: 01S00, Native error: 0, Message: [Microsoft][ODBC Driver 18 for SQL Server]Invalid connection string attribute
WARN State: 01000, Native error: 5701, Message: [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Changed database context to 'db_to_delta_test'.
WARN State: 01000, Native error: 5703, Message: [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Changed language setting to us_english.
DEBUG Database managment system name as reported by ODBC: Microsoft SQL Server
DEBUG ODBC driver reported for column 0. Relational type: BigInt; Nullability: NoNulls; Name: 'User_-_iD';
DEBUG ODBC driver reported for column 1. Relational type: Varchar { length: Some(100) }; Nullability: Nullable; Name: 'FirstName';
DEBUG ODBC driver reported for column 2. Relational type: WVarchar { length: None }; Nullability: Nullable; Name: 'LastName';
DEBUG ODBC driver reported for column 3. Relational type: Decimal { precision: 15, scale: 3 }; Nullability: Nullable; Name: 'Age';
DEBUG ODBC driver reported for column 4. Relational type: Varchar { length: Some(10) }; Nullability: NoNulls; Name: 'companyid';
DEBUG ODBC driver reported for column 5. Relational type: BigInt; Nullability: Nullable; Name: 'time_stäm';
DEBUG ODBC driver reported for column 6. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: '__timestamp';
DEBUG ODBC driver reported for column 7. Relational type: Bit; Nullability: Nullable; Name: '__is_deleted';
DEBUG ODBC driver reported for column 8. Relational type: Bit; Nullability: Nullable; Name: '__is_full_load';
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '2' reported 12.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '2' reported 100.
DEBUG Relational type of column 1: Varchar { length: Some(100) }
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '3' reported -9.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '3' reported 0.
DEBUG Relational type of column 2: WVarchar { length: None }
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '3' reported 0.
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '5' reported 12.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '5' reported 10.
DEBUG Relational type of column 4: Varchar { length: Some(10) }
INFO Column 'User_-_iD'
Bytes used per row: 8
INFO Column 'FirstName'
Bytes used per row: 409
INFO Column 'LastName'
Bytes used per row: 20009
INFO Column 'Age'
Bytes used per row: 26
INFO Column 'companyid'
Bytes used per row: 49
INFO Column 'time_stäm'
Bytes used per row: 16
INFO Column '__timestamp'
Bytes used per row: 24
INFO Column '__is_deleted'
Bytes used per row: 9
INFO Column '__is_full_load'
Bytes used per row: 9
INFO Total memory usage per row for single transit buffer: 20559
SELECT [User - iD] AS [User_-_iD], [FirstName] AS [FirstName], [LastName] AS [LastName], [Age] AS [Age], [companyid] AS [companyid], CAST([time stämp] AS BIGINT) AS [time_stämp], CAST(GETUTCDATE() AS datetime2(6)) AS __timestamp, CAST(0 AS BIT) AS __is_deleted, CAST(1 AS BIT) AS __is_full_load FROM [dbo].[user]
User_-_iD: int64 not null
FirstName: string
LastName: string
Age: decimal128(15, 3)
companyid: string not null
time_stäm: int64
__timestamp: timestamp[us]
__is_deleted: bool
__is_full_load: bool
DEBUG SQLFreeHandle dropped 0x55c451ef1b20 of type Stmt.
DEBUG SQLFreeHandle dropped 0x55c451f30050 of type Dbc.