Add 'narrow' Feature to arrow-odbc to Control Unicode Handling in odbc-api Dependency

yjhong79 commented 1 week ago

When using arrow-odbc, we are unable to retrieve column names containing Unicode characters from Oracle databases if the narrow feature is enabled in the odbc-api crate (which arrow-odbc depends on). This is because the narrow feature in odbc-api uses UTF-8 encoding, but Oracle's ODBC driver expects UTF-16 encoding for proper Unicode support.

Please add a narrow feature flag to arrow-odbc that allows users to enable or disable the narrow feature in the odbc-api dependency. This will give users control over the encoding used and ensure that Unicode column names are retrieved correctly from databases like Oracle that require UTF-16 encoding.

Request:

Add a narrow feature to arrow-odbc to control the narrow feature in odbc-api.
This change will allow users to disable the narrow feature when necessary, ensuring proper handling of Unicode characters.

pacman82 commented 1 week ago

Just to check, while narrow is active the ODBC standard suggestes that the client side encoding is used. On Windows this is usually something like latin1. On Linux most drivers seemed to ignore the standard and just use Utf8. On the other hand many Linux drivers struggle to implement Utf16 correctly.

I wonder if this time the Oracle driver sticks to the standard. Could you test if the system locale suggests to use UTF8? Is the something you can pass in the connection string to set the client encoding?

I am sympathetic to your cause though and will investigate how to best forward the decision to the user, while keeping the defaults intact.

yjhong79 commented 1 week ago

I have tested several SQL tools, including DBeaver, unixODBC's isql and iusql, Rust's odbc-api, C++'s nanoodbc, and others.

Based on my test results, it appears that Oracle does not perform character encoding conversion when creating tables or fetching column names that contain characters outside the ASCII range.

When I create a column named "컬럼" (which means "Column" in Korean) in DBeaver and execute SELECT "컬럼" FROM TEST_TABLE, it works correctly. However, other tools fail to recognize "컬럼". DBeaver uses JDBC with UTF16 encoding.

Conversely, if I create a column with UTF8 encoding using other tools, DBeaver cannot recognize "컬럼".

The hexadecimal representation of the column name in each encoding is as follows: UTF8: ec bb ac eb 9f bc UTF16: ceec b7fc

In DBeaver, when creating a table, the execution results vary depending on the feature used:

With odbc-api and the narrow feature enabled, only part of the UTF16 data is retrieved:

Column ColumnDescription { name: [ec, fc], data_type: Varchar { length: Some(64) }, nullability: Nullable }

With odbc-api and the without narrow feature, the full UTF16 data is retrieved, and the string is successfully constructed:
```
Column ColumnDescription { name: [ceec, b7fc], data_type: Varchar { length: Some(64) }, nullability: Nullable }
```

When creating a table with UTF8 encoding using iusql, the execution results are:

With odbc-api and the narrow feature enabled, the string is successfully retrieved:

Column ColumnDescription { name: [ec, bb, ac, eb, 9f, bc], data_type: Varchar { length: Some(64) }, nullability: Nullable }

With odbc-api and the with out narrow feature, incorrect UTF16 hex codes are returned, making it impossible to construct the string:

Column ColumnDescription { name: [ffec, ffbb, ffac, ffeb, ff9f, ffbc], data_type: Varchar { length: Some(64) }, nullability: Nullable }

This error occurs only when retrieving column names using SQLDescribeCol. When Unicode data is included in table rows, it works correctly

Since tools using JDBC are commonly used and they handle UTF16-encoded column names (like those in Korean), I believe it would be beneficial to have the ability to enable or disable the narrow feature in odbc-api to correctly retrieve such column names.

pacman82 commented 6 hours ago

Hello @yjhong79 ,

arrow-odbc 14.0.0 is released. It supports the wide feature flag. It can be used on non-windows platforms to use the wide UTF-16 version of the ODBC function calls.

Let me know if it works for you.

Best, Markus

pacman82 / arrow-odbc

Add 'narrow' Feature to arrow-odbc to Control Unicode Handling in odbc-api Dependency #109