oracle / python-cx_Oracle

Python interface to Oracle Database now superseded by python-oracledb
https://oracle.github.io/python-cx_Oracle
Other
890 stars 361 forks source link

Character garble problem #580

Closed Positever closed 3 years ago

Positever commented 3 years ago

Dear CX Oracle developer, hello, is there a function to distinguish between Chinese and English strings? For example, “你好??” comes from the database, and the final output is in Chinese: "你好??"

Positever commented 3 years ago

encoding ='gb18030'

cjbj commented 3 years ago

I don't understand what you want. Please properly complete the issue logging template questions.

Please share the SQL to create a (dummy) table with the data types you are querying. Include an INSERT statement to insert data. Also give a complete Python file that shows how you are connecting and querying the dummy table. Then explain what you are trying to achieve. Also give us the DB character set (see here).

Positever commented 3 years ago

Hello, my database character set is "ZHS16GBK" Python fileis roughly as follows:

!/usr/bin/python

-- coding:gb18030 --

...... con = cx_Oracle.connect(xxxx, xxxx, xxxx, encoding='gb18030') ......

Suppose the value of A field in the database is "你好??" # It also contains the Chinese hello, and the English punctuation ?? We want the contents of the result file to be written to the text to be the same as those in the database,and the content that was written to the file was "你好??",for example , and we expect the resulting file content to be "你好??"

cjbj commented 3 years ago

Please properly complete the issue logging template questions because information such as the version of cx_Oracle will let us help you better.

What version of Python are you using?

What's the column datatype (or datatypes)? Have you tried using encoding='UTF-8' (which is the default with cx_Oracle 8.0+)?

Positever commented 3 years ago

Dear developer, cjbj.

  1. Here's some information: platform.platform: Linux-3.10.0-957.el7.x86_64-with-glibc2.10 sys.maxsize > 2**32: True platform.python_version: 3.8.5 cx_Oracle.version: 7.3.0 cx_Oracle.clientversion: (11, 2, 0, 4, 0)
  2. Questions briefly: my database character set is "ZHS16GBK" Python fileis roughly as follows:

    !/usr/bin/python

    -- coding:gb18030 --

...... con = cx_Oracle.connect(xxxx, xxxx, xxxx, encoding='gb18030') ......

Suppose the value of A field in the database is "你好??" # It also contains the Chinese hello, and the English punctuation ?? We want the contents of the result file to be written to the text to be the same as those in the database,for example, and the content that was written to the file was "你好??" # all in Chinese, but we expect the resulting file content to be "你好??"

  1. Include a runnable Python script that shows the problem. Due to some reasons, the script that can reflect the problem cannot be provided for the moment, please understand, thank you!
Positever commented 3 years ago

Yes, I tried UTF-8, but UTF-8 is variable length. However, I need GB18030 to process only the Chinese part of the string, leaving the rest of the non-Chinese part unchanged

swolicki commented 3 years ago

If I understand correctly, the issue you have is that English punctuation characters are changed from ASCII normal width to East Asian full-width: 0x3F for question mark becomes 0xA3BF for full-width question mark. Oracle Database does not normally do such conversion by itself unless instructed to do so through the use of the TO_MULTI_BYTE function. Without all relevant code of your application, it is difficult to find out where the problem might be. It can even exist in the GUI that you use to verify the written files.

Positever commented 3 years ago

Dear Developer Swolicki, thank you for the response. Today I found the problem, these problems are caused by rare words such as 𥖄 and 䶮, please also enhance cX Oracle or Python literacy, thank you!

Positever commented 3 years ago

and the hexadecimal number of the part where the problem arises is "9639 B138"

swolicki commented 3 years ago

96 39 B1 38 is not a legal byte sequence in GBK. It is a GB18030 4-byte code. Therefore, you cannot store it in a ZHS16GBK database and expect it to be properly converted to any other character set, including GB18030. Oracle supports ZHS32GB18030 as a client-side character set only -- a ZHS32GB18030 database is not supported (even though you may be able to create one in older releases). Therefore, you need an AL32UTF8 database to store the problematic rare characters.

Positever commented 3 years ago

Thank you very much for your reply! dear swolicki! I have experimented with mysql database with UTF-8 character set using mysqlclient, apparently there is no problem. cx_Oracle is used to connect to utF-8 database, but the problem still exists, What's wrong with cx_Oracle?

swolicki commented 3 years ago

First, you have described a problem with ZH16GBK and gb18030 connection encoding . Now, you are talking about a different problem with UTF-8 but without actual details. Please, describe the problem in detail. We cannot help based on the information present in your last post. UTF-8 uses different encoding for the Han characters discussed.

Positever commented 3 years ago

It works well when the database character set is AL32UTF8, NVARCHAR2 can store richer words. I used this SQL statement: create table new_table(name1 nvarchar2(20)); insert into newtable values(n'𥖄'); The hexadecimal value of the result file generated by cx Oracle for GB18030 becomes 9639B138, That's what we're looking for, and thank you for your advice! dear swolicki!

cjbj commented 3 years ago

@Positever what exactly 'works well'?

If you post a complete script that connects and queries we can give accurate answers to questions

Try upgrading cx_Oracle to 8, or setting the national character set connection parameter nencoding (not just just the encoding parameter). The encoding and nencoding defaults changed in cx_Oracle 8.0, see point 5 in the release notes.

Read https://cx-oracle.readthedocs.io/en/latest/user_guide/globalization.html (or the cx_Oracle 7.3 version here).

I'd also strongly recommend upgrading your database since 11.2 is very old. It looks like you might have the XE edition. There is an 18c version available for download, and a 21c version will be released.

Positever commented 3 years ago

image Dear Developers, I would like to write the garbled characters into the file intact, without changing the character set of the ZHS16GBK database

cjbj commented 3 years ago

@Positever please post a runnable example script. Thanks.

Positever commented 3 years ago

As we discussed before, the rare Han character '𥖄' looks like '??' in oracle(ZHS16GBK) through PL/SQL developer. However, if i click the cell and swith TAB to Hex,it gives the orginal data '96 39 B1 38'. Just like the picture above.

As you know, my python code with cx_oracle package could access the table in oracle but returns 'A3 BF A3 BF' rather than '96 39 B1 38'(which we expect). We google the 'A3 BF' , this is the Hex data of Chinese question mark '?', which means the python code doesn't read the orignal data, it recognize the data as '??'

the question is: How could I access oracle and get the orignal data '96 39 B1 38'?

I am uploading the python and sql scripts.

Thanks.

swolicki commented 3 years ago

As I said, 96 39 B1 38 is not a valid code in ZHS16GBK. When you query the data in GB18030 or UTF-8, the database does not know how to convert these codes properly. You may try encoding ='gb18030' but I doubt it will help. You store GB-18030 data mislabeling it as GBK. As character set conversion is required when running Python (which uses Unicode internally), mislabeled data cannot be converted correctly and you see what you see. PL/SQL Developer uses OCI in pass-through configuration so you may see data correctly in the Hex tab. But this is cheating, not a correct configuration.

If you need to get the data into a text files untouched, set NLS_LANG=.ZHSGBK (pass-through configuration) and spool the data file in SQL*Plus, as a workaround. Alternatively, use the UTL_RAW.CAST_TO_RAW function in your query's SELECT clause to get your data as RAW data type into a byte buffer and spool the buffer in Python as bytes (to a binary file).

Positever commented 3 years ago

Thank you for the response! I used SQLPlus spool to generate the file, luckily there is no garble, the problem seems to have been solved.

swolicki commented 3 years ago

Great! Just remember that this is a workaround for your particular problem. The database should be migrated to the AL32UTF8 to be able to use non-GBK characters with Python, Java, node.js, .NET, or other Unicode-based technology.

Positever commented 3 years ago

Okay, thank you!

cjbj commented 3 years ago

I'll close this now. Thanks @swolicki

stale[bot] commented 3 years ago

This issue has been automatically marked as inactive because it has not been updated recently. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically closed because it has not been updated for a month.