oracle / python-oracledb

Python driver for Oracle Database conforming to the Python DB API 2.0 specification. This is the renamed, new major release of cx_Oracle
https://oracle.github.io/python-oracledb
Other
308 stars 61 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 0: invalid start byte #209

Closed shaik-jafar6 closed 9 months ago

shaik-jafar6 commented 11 months ago

I trying to query the table from oracle database. My table contains some chines language and german language characters in the data. when I am trying to read the data using python-oracledb i am getting below error.

Error: File "C:\Users\J\PycharmProjects\pythonProject\Static_Data.py", line 57, in data = cursor.fetchall() ^^^^^^^^^^^^^^^^^ File "C:\Users\J\AppData\Roaming\Python\Python311\site-packages\oracledb\cursor.py", line 457, in fetchall row = fetch_next_row(self) ^^^^^^^^^^^^^^^^^^^^ File "src\oracledb\impl/base/cursor.pyx", line 399, in oracledb.base_impl.BaseCursorImpl.fetch_next_row File "src\oracledb\impl/base/cursor.pyx", line 210, in oracledb.base_impl.BaseCursorImpl._create_row File "src\oracledb\impl/thick/var.pyx", line 134, in oracledb.thick_impl.ThickVarImpl._get_scalar_value File "src\oracledb\impl/thick/var.pyx", line 272, in oracledb.thick_impl.ThickVarImpl._transform_element_to_python File "src\oracledb\impl/thick/utils.pyx", line 284, in oracledb.thick_impl._convert_to_python UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 0: invalid start byte

Code: import oracledb oracledb.init_oracle_client() conn = oracledb.connect(user="####", password="####", dsn="####") cursor = conn.cursor() cursor.execute("SELECT * FROM Table")

data = cursor.fetchall()

data = [row.decode('utf-8'') for row in data] df = pd.DataFrame(data) print(df.head())

cursor.close() conn.close()

cjbj commented 11 months ago

Check https://python-oracledb.readthedocs.io/en/latest/user_guide/sql_execution.html#querying-corrupt-data

shaik-jafar6 commented 11 months ago

Check https://python-oracledb.readthedocs.io/en/latest/user_guide/sql_execution.html#querying-corrupt-data

Thank you for this. The issue still persists after adding the output handler because below is my data sample. Chinese characters are retrieving but there are some special characters in the name

FIRST_NAME 佳濱 �AIRUI �

I ran few queries in db below are the values NLS_CHARACTERSET = AL32UTF8 NLS_NCHAR_CHARACTERSET = AL16UTF16

anthony-tuininga commented 11 months ago

The problem is that the Chinese characters in your database are not stored in the UTF-8 encoding (in Oracle parlance that is AL32UTF8). How did they get inserted? If you use python-oracledb to insert the data it will then be able to retrieved properly, too. Try setting the environment variable NLS_LANG as follows:

set NLS_LANG=.AL32UTF8

Then run SQL*Plus and run that query. You should see similar "bad" results. Whoever is inserting the data is not inserting them correctly. You will need to find out what encoding they are actually using and then decode the data that way. You can use this approach to bypass the decode and do it yourself. Once the corrupted data is fixed and the source of the corrupt data is also fixed you can fetch normally again.

shaik-jafar6 commented 11 months ago

@anthony-tuininga The issue is not with chinese character i guess. It's due to some special character. To set the context I am querying data from oracle using script and writing it into a s3 bucket. Please guide me how to handle it from the code side to get as it is from database and dump into s3 as a file. As you can see in the below image i am able to read the data in the first line without any issues and the issue is with the row 2 and 3 which has square box character.

data
anthony-tuininga commented 11 months ago

You need to find out first how the data is being inserted into the database. If the environment variable NLS_LANG is set to .AL32UTF8 the database doesn't check the data that is beng inserted but accepts it as is. So if the client isn't actually encoding in UTF-8 but in some other encoding you will see the issues you are seeing. The solution is to find out what that encoding is so you can use the bypass_encoding parameter and then decode it yourself using the encoding it was actually encoded with! You can also fix the corrupted data in the database (once you fix the source of the corruption).

shaik-jafar6 commented 11 months ago

The data is being inserted into oracle from SAP application. I think the encoding set at database level is .AL32UTF8. so did tried the below code suggested by you but it still raises the same error.

`os.environ["NLS_LANG"] = ".AL32UTF8" oracledb.init_oracle_client() conn = oracledb.connect(user="*", password="", dsn="****") cursor = conn.cursor()

define output type handler

def return_strings_as_bytes(cursor, name, default_type, size, precision, scale): if default_type == oracledb.DB_TYPE_VARCHAR: return cursor.var(str, arraysize=cursor.arraysize, bypass_decode=True)

set output type handler on cursor before fetching data

with conn.cursor() as cursor: cursor.outputtypehandler = return_strings_as_bytes cursor.execute("SELECT first_name,last_name FROM ***.NAMAD ORDER BY FIRST_NAME DESC FETCH FIRST 3 ROWS ONLY") data = cursor.fetchall() print(data) print(data[0][1].decode("UTF-16"))`

Print data is the below one.
[(b'\xe4\xbd\xb3\xe6\xbf\xb1', b'\xe6\xb1\xa4'), (b'\xbfAIRUI', b'SONG'), (b'\xbf', b'CUI')]
anthony-tuininga commented 11 months ago

Setting the value of NLS_LANG with python-oracledb will not change anything. I was talking of doing so before running SQL*Plus (which will pay attention to the character set in that environment variable). That will show the same issues you are seeing in python-oracledb.

The data that is initially returned is the raw data stored in the database. You need to find out what the SAP application is doing. What encoding is it using when storing the data in the database? That is the encoding you will need to use to decode. Clearly it is not UTF-16! I'm not familiar enough with Chinese characters to be able to provide a reasonable guess as to the encoding.

anthony-tuininga commented 9 months ago

Closing - no further information provided