Closed Narquadah closed 11 months ago
Hello @Narquadah ,
thanks for opening the issue. It is fine to open the issue here, since this is the artifact you are using. Since there are different thing happening to the encoding depending on which platform you are on, I would be interested to learn if you are using windows, OS-X or Linux.
If you are using Linux, please check if your configured system locale indicates to use UTF-8.
We could also look at the byte sequence returned in the error:
66, 117, 99, 104, 117, 110, 103, 115, 115, 99, 104, 108, 252, 115, 115, 101, 108
B u c h u n g s s c h l ü s s e l
Problem seem (as you guessed correctly) the German ü
. However it is encoded as 252
. This is not a utf-8 encoding but could e.g. be an extended latin-1 ASCII page. It seems to me that the system is not configured to use a UTF-8 charset, but an extended ASCII suitable for the German region.
Best, Markus
Thanks for your quick reply.
I am using a mac, and you are correct the encoding used by the database is Latin1_General_CI_AS
.
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
You are welcome.
arrow-odbc
only supports utf-8. Because utf-8 locals on windows are rare it would there use utf-16. To make your database both column names and payload work with arrow-odbc please change the database to use utf-8 encoding. Maybe you can specify something in the connection string?
To clarify, it does not matter what encoding the database uses internally to store stuff, but the encoding which matters is the client encoding. According to the ODBC standard that should be defined by the system local, but many drivers do their own thing and also allow this to be specified in the connection string. Anyhow, you need to get your database to return UTF-8 somehow. I currently do not have any plans to also support ASCII encoding.
Sadly, Microsoft does not offer such a solution. Would it be possible for me to catch the error? A normal try execpt does not seem to work.
It will be, but I must fix this first. Right now you have no way to catch the panic :-(.
I do not get it. ODBC should use the local specified in your system. I actually have MSSQL server in my test setup for many artefacts. Including some running under OS-X. Never ran into that error.
Alternatively I could try to compile an UTF-16 version for OS-X. This would work independent of specified encodings. Yet this would raise questions of how to distribute it. Or I would need to figure out a way to let the user choose at runtime.
Another workaround could be, to rename the column name in the query as in SELECT ... as MY_COLUMN_NAME_WITHOUT_UMLAUT
?
In this case, I am just filtering out the affected tables, as they are not actively used anymore. But as there are 800 something tables, it gets quite tedious. Thanks for the idea. :)
arrow-odbc 2.0.5
has been released. It features a bugfix, which would raise an exception, rather than panicking.
One last thought: Did you try changing the locale used by the python interpreter to something using UTF-8?
Stackoverflow question is for windows, but might just work on Linux:
https://stackoverflow.com/questions/955986/what-is-the-correct-way-to-set-pythons-locale-on-windows
Nothing actionable remains for me here. Closing.
Hello,
While reading data from an MS SQL database, an error throws that can't be caught.
It seems the column names contain German Umlaute (äöü) which shouldn't be a problem as they are UTF8 and other tools read them just fine.
Please let me know if I should open the issue in the upstream repo.
Any help would be appreciated!
Thank you!