mkleehammer / pyodbc

Python ODBC bridge
https://github.com/mkleehammer/pyodbc/wiki
MIT No Attribution
2.93k stars 561 forks source link

Problem with utf-8 translation on fields longer than 4096 bytes (Netezza ODBC) #1022

Closed setinam closed 2 years ago

setinam commented 2 years ago

Environment

Issue

When extracting data from Netezza, we encountered a problem with large NVARCHAR fields (size 10000) that have content larger that 4096. We get an error: utf8 codec cant decode byte 0xc5 in position 4094 invalid continuation byte

position 4094 is repeated on multiple values in the field and also on different tables with similar size and content.

v-chojas commented 2 years ago

Can you post an ODBC trace?

setinam commented 2 years ago

Hope this part helps:

[ODBC][2788023][1643963658.727724][SQLFetch.c][162]
        Entry:
            Statement = 0x56282469b880
[ODBC][2788023][1643963658.727730][SQLFetch.c][352]
        Exit:[SQL_SUCCESS]
[ODBC][2788023][1643963658.727745][SQLGetData.c][237]
        Entry:
            Statement = 0x56282469b880
            Column Number = 1
            Target Type = 4 SQL_INTEGER
            Buffer Length = 4
            Target Value = 0x7fff129018f8
            StrLen Or Ind = 0x7fff12901900
[ODBC][2788023][1643963658.727754][SQLGetData.c][545]
        Exit:[SQL_SUCCESS]                
            Buffer = [<integer_value>]                
            Strlen Or Ind = 0x7fff12901900 -> 4
[ODBC][2788023][1643963658.727760][SQLGetData.c][237]
        Entry:
            Statement = 0x56282469b880
            Column Number = 2
            Target Type = 1 SQL_CHAR
            Buffer Length = 4096
            Target Value = 0x5628246c5410
            StrLen Or Ind = 0x7fff12901850
[ODBC][2788023][1643963658.727785][SQLGetData.c][545]
        Exit:[SQL_SUCCESS_WITH_INFO]                
            Buffer = [<PART OF VALUE>]                
            Strlen Or Ind = 0x7fff12901850 -> 4291
[ODBC][2788023][1643963658.727791][SQLGetData.c][237]
        Entry:
            Statement = 0x56282469b880
            Column Number = 2
            Target Type = 1 SQL_CHAR
            Buffer Length = 197
            Target Value = 0x5628246c640f
            StrLen Or Ind = 0x7fff12901850
[ODBC][2788023][1643963658.727797][SQLGetData.c][545]
        Exit:[SQL_SUCCESS]                
            Buffer = [<REST OF VALUE>]                
            Strlen Or Ind = 0x7fff12901850 -> 194
[ODBC][2788023][1643963658.731692][SQLFreeStmt.c][144]
        Entry:
            Statement = 0x56282469b880
            Option = 0
[ODBC][2788023][1643963658.731715][SQLFreeStmt.c][266]
        Exit:[SQL_SUCCESS]
[ODBC][2788023][1643963658.731723][SQLFreeHandle.c][387]
        Entry:
            Handle Type = 3
            Input Handle = 0x56282469b880
[ODBC][2788023][1643963658.731733][SQLFreeHandle.c][490]
        Exit:[SQL_SUCCESS]
[ODBC][2788023][1643963658.731740][SQLEndTran.c][417]
        Entry:                
            Connection = 0x562824670cc0                
            Completion Type = 1
[ODBC][2788023][1643963658.731750][SQLGetInfo.c][236]
        Entry:
            Connection = 0x562824670cc0
            Info Type = SQL_CURSOR_COMMIT_BEHAVIOR (23)
            Info Value = 0x562824672138
            Buffer Length = 2
            StrLen = 0x7fff12901b20
[ODBC][2788023][1643963658.731758][SQLGetInfo.c][236]
        Entry:
            Connection = 0x562824670cc0
            Info Type = SQL_CURSOR_ROLLBACK_BEHAVIOR (24)
            Info Value = 0x56282467213a
            Buffer Length = 2
            StrLen = 0x7fff12901b20
[ODBC][2788023][1643963658.731764][SQLEndTran.c][566]
        Exit:[SQL_SUCCESS]
[ODBC][2788023][1643963658.731769][SQLDisconnect.c][208]
        Entry:
            Connection = 0x562824670cc0
[ODBC][2788023][1643963658.731814][SQLDisconnect.c][379]
        Exit:[SQL_SUCCESS]
[ODBC][2788023][1643963658.731821][SQLFreeHandle.c][290]
        Entry:
            Handle Type = 2
            Input Handle = 0x562824670cc0
[ODBC][2788023][1643963658.731838][SQLFreeHandle.c][339]
        Exit:[SQL_SUCCESS]
gordthompson commented 2 years ago
[ODBC][2788023][1643963658.727760][SQLGetData.c][237]
        Entry:
            Statement = 0x56282469b880
            Column Number = 2
            Target Type = 1 SQL_CHAR
            Buffer Length = 4096
            Target Value = 0x5628246c5410
            StrLen Or Ind = 0x7fff12901850
[ODBC][2788023][1643963658.727785][SQLGetData.c][545]
        Exit:[SQL_SUCCESS_WITH_INFO]                
            Buffer = [<PART OF VALUE>]                
            Strlen Or Ind = 0x7fff12901850 -> 4291
[ODBC][2788023][1643963658.727791][SQLGetData.c][237]
        Entry:
            Statement = 0x56282469b880
            Column Number = 2
            Target Type = 1 SQL_CHAR
            Buffer Length = 197
            Target Value = 0x5628246c640f
            StrLen Or Ind = 0x7fff12901850
[ODBC][2788023][1643963658.727797][SQLGetData.c][545]
        Exit:[SQL_SUCCESS]                
            Buffer = [<REST OF VALUE>]                
            Strlen Or Ind = 0x7fff12901850 -> 194

The driver is returning the first 4096 bytes (characters, I presume) in response to the first SQLGetData call and then sending the remaining characters (~197 or so) in response to the second SQLGetData call but there is apparently some extra "junk" in there that does not correspond to proper UTF-8 encoding.

v-makouz commented 2 years ago

It seems like it cuts in the middle of a UTF-8 character and for some reason doesn't resume reading. Is it possible to isolate the issue by trying a different ODBC driver?

setinam commented 2 years ago

Closed by mistake. We have another database on the same box and will try to find a similar content.

setinam commented 2 years ago

@gordthompson the error is triggered in first part as it says: utf8 codec cant decode byte 0xc5 in position 4094 invalid continuation byte

we use: UnicodeTranslationOption=utf8 CharacterTranslationOption=all

in odbcinst.ini for Netezza as there is a problem with utf16 and conn.setdecoding(mod.SQL_CHAR, encoding='utf-8') conn.setdecoding(mod.SQL_WCHAR, encoding='utf-8') conn.setdecoding(mod.SQL_WMETADATA, encoding='utf-8') conn.setencoding(encoding='utf-8') on ODBC connection.

so I would lean to @v-makouz suggestion

setinam commented 2 years ago

Tried similar thing with MSSQL: [ODBC][3450941][1644589431.394384][SQLFetch.c][162] Entry: Statement = 0x55db85be18b0 [ODBC][3450941][1644589431.394395][SQLFetch.c][352] Exit:[SQL_SUCCESS] [ODBC][3450941][1644589431.394402][SQLGetData.c][237] Entry: Statement = 0x55db85be18b0 Column Number = 1 Target Type = -8 SQL_WCHAR Buffer Length = 4096 Target Value = 0x55db85bf6910 StrLen Or Ind = 0x7ffd008d89f0 [ODBC][3450941][1644589431.394443][SQLGetData.c][545] Exit:[SQL_SUCCESS_WITH_INFO]
Buffer =
Strlen Or Ind = 0x7ffd008d89f0 -> 18274 [ODBC][3450941][1644589431.394454][SQLGetData.c][237] Entry: Statement = 0x55db85be18b0 Column Number = 1 Target Type = -8 SQL_WCHAR Buffer Length = 14182 Target Value = 0x55db85bfac8e StrLen Or Ind = 0x7ffd008d89f0 [ODBC][3450941][1644589431.414401][SQLGetData.c][545] Exit:[SQL_SUCCESS]
Buffer =
Strlen Or Ind = 0x7ffd008d89f0 -> 14180 [ODBC][3450941][1644589431.414436][SQLGetData.c][237] Entry: Statement = 0x55db85be18b0 Column Number = 2 Target Type = -25 SQL_C_SBIGINT Buffer Length = 8 Target Value = 0x7ffd008d8a98 StrLen Or Ind = 0x7ffd008d8aa0 [ODBC][3450941][1644589431.414451][SQLGetData.c][545] Exit:[SQL_SUCCESS]
Buffer = [Data...]
Strlen Or Ind = 0x7ffd008d8aa0 -> 8 [ODBC][3450941][1644589431.414458][SQLFetch.c][162] Entry: Statement = 0x55db85be18b0 [ODBC][3450941][1644589431.414464][SQLFetch.c][352] Exit:[SQL_NO_DATA] [ODBC][3450941][1644589431.414738][SQLFetch.c][162] Entry: Statement = 0x55db85be18b0 [ODBC][3450941][1644589431.414746][SQLFetch.c][352] Exit:[SQL_NO_DATA] [ODBC][3450941][1644589431.417488][SQLFreeStmt.c][144] Entry: Statement = 0x55db85be18b0 Option = 0 [ODBC][3450941][1644589431.417509][SQLFreeStmt.c][266] Exit:[SQL_SUCCESS] [ODBC][3450941][1644589431.417524][SQLFreeHandle.c][387] Entry: Handle Type = 3 Input Handle = 0x55db85be18b0 [ODBC][3450941][1644589431.417539][SQLFreeHandle.c][490] Exit:[SQL_SUCCESS] [ODBC][3450941][1644589431.417546][SQLDisconnect.c][208] Entry: Connection = 0x55db85ac0ed0 [ODBC][3450941][1644589431.418363][SQLDisconnect.c][379] Exit:[SQL_SUCCESS] [ODBC][3450941][1644589431.418383][SQLFreeHandle.c][290] Entry: Handle Type = 2 Input Handle = 0x55db85ac0ed0 [ODBC][3450941][1644589431.418393][SQLFreeHandle.c][339] Exit:[SQL_SUCCESS]

setinam commented 2 years ago

On Netezza we tested and if field is split into multiple columns of length 4000 in query the error is not triggered

gordthompson commented 2 years ago

Can you retrieve a long column value from Netezza if you connect to the database using the isql utility?

setinam commented 2 years ago

isql does not trigger the error. Full data is retrieved

v-chojas commented 2 years ago

According to the code in src/getdata.cpp pyODBC doesn't do encoding conversion until after the entire cell is fetched, so this seems like a bug in the driver that's causing it to return corrupted data.