oracle / python-cx_Oracle

Python interface to Oracle Database now superseded by python-oracledb
https://oracle.github.io/python-cx_Oracle
Other
889 stars 361 forks source link

How to deal with encoding issues while fetching rows returned by callfunc method? #590

Closed iamdbychkov closed 3 years ago

iamdbychkov commented 3 years ago

Currently in a process of migration from python2.7 to python3.6 and cx_Oracle 7.3.0.

Today, while testing, I stumbled upon an UnicodeDecodeError while fetching data:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 1: invalid start byte

To fetch data from database I have to use cursor.callfunc method, which return type is cx_Oracle.CURSOR, and .fetchmany() to load in batches.

In my research I've found out that there are some invalid UTF-8 string in database.

The source of invalid strings is another application, which loads data from parsed files, which, in turn, may contain anything.

Since my application main purpose is to send fetched data over interner i've never been bothered by decoding bytes while app was running using python2.7.

But now it became a problem.

I've wanted to try to use outputtypehandler for VARCHAR column, but the return values of callfunc are not affected by this. I've also seen this article in documentation: https://cx-oracle.readthedocs.io/en/7.3/user_guide/globalization.html#character-set-example which says:

Because the ‘€’ symbol is not supported by the WE8ISO8859P1 character set, all ‘€’ characters are replaced by ‘¿’ in the cx_Oracle output

But for some reason, I've got no replacements, my guess it's also not an option for cursor.callfunc?

Client and database encoding is AMERICAN_AMERICA.AL32UTF8

How one can solve such issue? Is there a way to solve an issue on the client side, without enforcement of utf8 encoding for every string which enters database? (fixing all clients which can write to database)

anthony-tuininga commented 3 years ago

The only way that Oracle Database will return replacements for characters is when the database character set contains characters that cannot be represented in the client character set. Since your client character set is UTF-8, that will never happen because UTF-8 is capable of representing all characters (it is a universal character set).

What is your database character set? You can find out by running the query mentioned in the documentation. Note that if the client and database indicate they are using the same character set that the Oracle Database doesn't perform any checking of the data that is put into the database. This can allow invalid data to be stored in the database.

You can use the outputtypehandler option with cursor.callproc() and cursor.callfunc() calls that return a REF cursor. You can do this:

cursor = conn.cursor()
ref_cursor = conn.cursor()
ref_cursor.outputtypehandler = output_type_handler
cursor.callproc("some_procedure_with_ref_cursors", [ref_cursor])
rows = ref_cursor.fetchall()

You can also do this:

conn.outputtypehandler = output_type_handler
ref_cursor = conn.cursor()
cursor.callproc("some_procedure_with_ref_cursors", [ref_cursor])
rows = ref_cursor.fetchall()

If you put the output type handler on the connection, all cursors that were created by that connection will use it. Since you are returning a cursor, this is going to be the only approach you can use -- unless you use a variable and define it as the return type (more complex).

When building the variable you can use the encoding_errors option to tell Python what to do when it encounters corrupt data. See the documentation for an example.

Finally, if you use cx_Oracle 8.2 instead of cx_Oracle 7.3 you have another option available to you. When building the variable you can use the new parameter bypass_decode which will tell cx_Oracle to not bother decoding the text but return it directly to you as bytes -- the same as with Python 2.7. You can see an example in the documentation.

iamdbychkov commented 3 years ago

Thank you very much, Anthony!

Database and client's charsets are AL32UTF8.

Setting outputtypehandler on connection object indeed worked. bypass_decode is also an interesting option which I'm yet to try.

Besided solving some random person problem you also provided useful information about client-database encoding behaviour. And in general I can see you're a very active participant in responding to people's questions and resolving their issues. Role model. Thank you once again.

anthony-tuininga commented 3 years ago

You're very welcome and I'm happy I could be of help!