pacman82 / arrow-odbc-py

Read Apache Arrow batches from ODBC data sources in Python
MIT License
60 stars 5 forks source link

Column Name UTF8 Error #70

Closed Narquadah closed 11 months ago

Narquadah commented 11 months ago

Hello,

While reading data from an MS SQL database, an error throws that can't be caught.

arrow-odbc-4.1.1/src/schema.rs:50:14:
Column name must be representable in utf8: FromUtf8Error { bytes: [66, 117, 99, 104, 117, 110, 103, 115, 115, 99, 104, 108, 252, 115, 115, 101, 108], error: Utf8Error { valid_up_to: 12, error_len: Some(1) } }
stack backtrace:
   0:        0x11a50c138 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h3c8b3da4c3ca3a14
   1:        0x11a4c95d0 - core::fmt::write::hfb70cbdb2260ac51
   2:        0x11a502454 - std::io::Write::write_fmt::hc13c5ba5d088bd95
   3:        0x11a50bedc - std::sys_common::backtrace::print::hb0798cc2b68a4b36
   4:        0x11a4fd454 - std::panicking::default_hook::{{closure}}::hb6b09a0c32b10ee5
   5:        0x11a4fe244 - std::panicking::rust_panic_with_hook::h491fddbcf07c6736
   6:        0x11a50c48c - std::panicking::begin_panic_handler::{{closure}}::ha2bc72305b00ceb6
   7:        0x11a50c3f8 - std::sys_common::backtrace::__rust_end_short_backtrace::hb9c89d964676cd3d
   8:        0x11a4fd8d4 - _rust_begin_unwind
   9:        0x11a515198 - core::panicking::panic_fmt::h5de4b603c189570c
  10:        0x11a515484 - core::result::unwrap_failed::h6c2edae44e6d47ca
  11:        0x11a4a75e8 - arrow_odbc::reader::odbc_reader::OdbcReaderBuilder::build::h36ee0dd824483534
  12:        0x11a4afa08 - _arrow_odbc_reader_make
  13:        0x19dada050 - <unknown>
  14:        0x19dae2adc - <unknown>
  15:        0x119d85850 - _cdata_call
  16:        0x103371de4 - __PyObject_MakeTpCall
  17:        0x103451704 - __PyEval_EvalFrameDefault
  18:        0x103448a8c - __PyEval_Vector
  19:        0x10337215c - __PyVectorcall_Call
  20:        0x103453240 - __PyEval_EvalFrameDefault
  21:        0x10338b1fc - _gen_send_ex2
  22:        0x103389d58 - _gen_iternext
  23:        0x10344b740 - __PyEval_EvalFrameDefault
  24:        0x10338b1fc - _gen_send_ex2
  25:        0x103389d58 - _gen_iternext
  26:        0x103450a68 - __PyEval_EvalFrameDefault
  27:        0x10338b1fc - _gen_send_ex2
  28:        0x103389d58 - _gen_iternext
  29:        0x10337f8d4 - _enum_next
  30:        0x103450a68 - __PyEval_EvalFrameDefault
  31:        0x10344895c - _PyEval_EvalCode
  32:        0x10344548c - _builtin_exec
  33:        0x1033bc66c - _cfunction_vectorcall_FASTCALL_KEYWORDS
  34:        0x103372284 - _PyObject_Vectorcall
  35:        0x103451704 - __PyEval_EvalFrameDefault
  36:        0x10344895c - _PyEval_EvalCode
  37:        0x10349bdfc - _run_mod
  38:        0x10349a2f0 - __PyRun_SimpleFileObject
  39:        0x103499d78 - __PyRun_AnyFileObject
  40:        0x1034b91c4 - _Py_RunMain
  41:        0x1034b9574 - _pymain_main
  42:        0x1034b9614 - _Py_BytesMain

It seems the column names contain German Umlaute (äöü) which shouldn't be a problem as they are UTF8 and other tools read them just fine.

Please let me know if I should open the issue in the upstream repo.

Any help would be appreciated!

Thank you!

pacman82 commented 11 months ago

Hello @Narquadah ,

thanks for opening the issue. It is fine to open the issue here, since this is the artifact you are using. Since there are different thing happening to the encoding depending on which platform you are on, I would be interested to learn if you are using windows, OS-X or Linux.

If you are using Linux, please check if your configured system locale indicates to use UTF-8.

We could also look at the byte sequence returned in the error:

66, 117, 99, 104, 117, 110, 103, 115, 115, 99, 104, 108, 252, 115, 115, 101, 108
B    u     c     h      u      n     g      s       s      c    h      l      ü       s       s       e       l

Problem seem (as you guessed correctly) the German ü. However it is encoded as 252. This is not a utf-8 encoding but could e.g. be an extended latin-1 ASCII page. It seems to me that the system is not configured to use a UTF-8 charset, but an extended ASCII suitable for the German region.

Best, Markus

Narquadah commented 11 months ago

Thanks for your quick reply. I am using a mac, and you are correct the encoding used by the database is Latin1_General_CI_AS.

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
pacman82 commented 11 months ago

You are welcome.

arrow-odbc only supports utf-8. Because utf-8 locals on windows are rare it would there use utf-16. To make your database both column names and payload work with arrow-odbc please change the database to use utf-8 encoding. Maybe you can specify something in the connection string?

pacman82 commented 11 months ago

To clarify, it does not matter what encoding the database uses internally to store stuff, but the encoding which matters is the client encoding. According to the ODBC standard that should be defined by the system local, but many drivers do their own thing and also allow this to be specified in the connection string. Anyhow, you need to get your database to return UTF-8 somehow. I currently do not have any plans to also support ASCII encoding.

Narquadah commented 11 months ago

Sadly, Microsoft does not offer such a solution. Would it be possible for me to catch the error? A normal try execpt does not seem to work.

pacman82 commented 11 months ago

It will be, but I must fix this first. Right now you have no way to catch the panic :-(.

I do not get it. ODBC should use the local specified in your system. I actually have MSSQL server in my test setup for many artefacts. Including some running under OS-X. Never ran into that error.

Alternatively I could try to compile an UTF-16 version for OS-X. This would work independent of specified encodings. Yet this would raise questions of how to distribute it. Or I would need to figure out a way to let the user choose at runtime.

pacman82 commented 11 months ago

Another workaround could be, to rename the column name in the query as in SELECT ... as MY_COLUMN_NAME_WITHOUT_UMLAUT ?

Narquadah commented 11 months ago

In this case, I am just filtering out the affected tables, as they are not actively used anymore. But as there are 800 something tables, it gets quite tedious. Thanks for the idea. :)

pacman82 commented 11 months ago

arrow-odbc 2.0.5 has been released. It features a bugfix, which would raise an exception, rather than panicking.

pacman82 commented 11 months ago

One last thought: Did you try changing the locale used by the python interpreter to something using UTF-8?

Stackoverflow question is for windows, but might just work on Linux:

https://stackoverflow.com/questions/955986/what-is-the-correct-way-to-set-pythons-locale-on-windows

pacman82 commented 11 months ago

Nothing actionable remains for me here. Closing.