pacman82 / arrow-odbc

Fill Apache Arrow record batches from an ODBC data source in Rust.
MIT License
53 stars 10 forks source link

Add 'narrow' Feature to arrow-odbc to Control Unicode Handling in odbc-api Dependency #109

Open yjhong79 opened 1 week ago

yjhong79 commented 1 week ago

When using arrow-odbc, we are unable to retrieve column names containing Unicode characters from Oracle databases if the narrow feature is enabled in the odbc-api crate (which arrow-odbc depends on). This is because the narrow feature in odbc-api uses UTF-8 encoding, but Oracle's ODBC driver expects UTF-16 encoding for proper Unicode support.

Please add a narrow feature flag to arrow-odbc that allows users to enable or disable the narrow feature in the odbc-api dependency. This will give users control over the encoding used and ensure that Unicode column names are retrieved correctly from databases like Oracle that require UTF-16 encoding.

Request:

pacman82 commented 1 week ago

Just to check, while narrow is active the ODBC standard suggestes that the client side encoding is used. On Windows this is usually something like latin1. On Linux most drivers seemed to ignore the standard and just use Utf8. On the other hand many Linux drivers struggle to implement Utf16 correctly.

I wonder if this time the Oracle driver sticks to the standard. Could you test if the system locale suggests to use UTF8? Is the something you can pass in the connection string to set the client encoding?

I am sympathetic to your cause though and will investigate how to best forward the decision to the user, while keeping the defaults intact.

yjhong79 commented 1 week ago

I have tested several SQL tools, including DBeaver, unixODBC's isql and iusql, Rust's odbc-api, C++'s nanoodbc, and others.

Based on my test results, it appears that Oracle does not perform character encoding conversion when creating tables or fetching column names that contain characters outside the ASCII range.

When I create a column named "컬럼" (which means "Column" in Korean) in DBeaver and execute SELECT "컬럼" FROM TEST_TABLE, it works correctly. However, other tools fail to recognize "컬럼". DBeaver uses JDBC with UTF16 encoding.

Conversely, if I create a column with UTF8 encoding using other tools, DBeaver cannot recognize "컬럼".

The hexadecimal representation of the column name in each encoding is as follows: UTF8: ec bb ac eb 9f bc UTF16: ceec b7fc

In DBeaver, when creating a table, the execution results vary depending on the feature used:

When creating a table with UTF8 encoding using iusql, the execution results are:

This error occurs only when retrieving column names using SQLDescribeCol. When Unicode data is included in table rows, it works correctly

Since tools using JDBC are commonly used and they handle UTF16-encoded column names (like those in Korean), I believe it would be beneficial to have the ability to enable or disable the narrow feature in odbc-api to correctly retrieve such column names.

pacman82 commented 6 hours ago

Hello @yjhong79 ,

arrow-odbc 14.0.0 is released. It supports the wide feature flag. It can be used on non-windows platforms to use the wide UTF-16 version of the ODBC function calls.

Let me know if it works for you.

Best, Markus