Emoji and chinese character cannot load to oracle

figoliu239 commented 3 months ago

I am loading some chinese character and emoji data from postgres to oracle. Most chinese character can be displayed properly but the emojis and some chinese character will becomes ? in oracle.

For example, the string ‘😊😔於本週Google針2025座枱年歷’ will becomes '？？於本週Google針2025座？年歷' in oracle DB.

I have tried to change the nls_lang but it still doesn't work.

ALTER FOREIGN DATA WRAPPER oracle_fdw OPTIONS (ADD nls_lang 'AMERICAN_AMERICA.ZHT16BIG5');

Database setting:

POSTGRES:

SHOW server_encoding;
 server_encoding
-----------------
 UTF8
(1 row)

ORACLE:

select * from V$NLS_PARAMETERS
where parameter in ('NLS_LANGUAGE','NLS_TERRITORY','NLS_CHARACTERSET','NLS_NCHAR_CHARACTERSET');
-----------------
NLS_LANGUAGE    AMERICAN
NLS_TERRITORY   AMERICA
NLS_CHARACTERSET    ZHT16BIG5
NLS_NCHAR_CHARACTERSET  AL16UTF16

mkgrgis commented 3 months ago

@figoliu239 , have you got standard character map as documents for ZHT16BIG5 like https://www.unicode.org/charts/PDF/ for Unicode? Did you ensure ZHT16BIG5 contains requested characters?

figoliu239 commented 3 months ago

Yeah cause actually i have a txt file encoded with UTF-8 with some Chinese and emoji characters. I load it in postgres by java batch insert and also load into oracle by sqlloader. I can view the correct characters in both database.

But if i just load it into postgres and then copy the data to oracle through the oracle_fdw. Some characters will turns to ?.

I am using oracle_fdw version 1.2 which is a pretty old version. Does the version matter?

laurenz commented 3 months ago

Version 1.2? Are you using some ancient PostgreSQL version that are not supported by recent versions of oracle_fdw? If not, you should definitely upgrade. I am not sure if any of the many changes to oracle_fdw have changed anything that concerns your problem; if anything, I'd suspect commit 77242d1fef.

Anyway, I suspect a different problem. The presence of the ? characters suggests that the problem is in Oracle's character set conversion. Different from PostgreSQL, which will give you an error if you try to insert a character that cannot be converted to the server encoding, Oracle will just silently convert such untranslatable characters to “replacement characters” such as ?.

If your PostgreSQL server encoding is UTF8, oracle_fdw will set the Oracle client character set to AL32UTF8. So I guess that Oracle has some kind of trouble converting the characters from AL32UTF8 to ZHT16BIG5.

You could verify that with the following experiment:

export NLS_LANG=AMERICAN_AMERICA.AL32UTF8
# connect to Oracle with sqlplus
sqlplus ...
SQL> INSERT INTO tab (col) VALUES ('😊😔於本週Google針2025座枱年歷’');

Then check if you see the same behavior. If yes, that confirms that Oracle's character set conversion is at fault. Since PostgreSQL supports no equivalent to the ZHT16BIG5 character set, your only option would be to report that as a bug to Oracle and hope that they fix (or improve) their character set conversion routines.

figoliu239 commented 3 months ago

My postgreSQL version is 15.5 and I am using the default oracle_fdw.

When I check the version in pg_available_extensions , it shows that the default version is 1.2. But when I call SELECT oracle_diag(); , it shows oracle_fdw 2.5.0, PostgreSQL 15.5, Oracle client 19.19.0.0.0, ORACLE_HOME=/data/instantclient_19_19.

So I am not sure whether the version is 1.2 or 2.5.

‌ Besides, I also tried to run the insert statement in the sqlplus and sqldeveloper:

--create table
create table ofdm_dw.Testing_emoji(
seq number(10),
nstring nvarchar2(2000)
);

--run in linux
export NLS_LANG=AMERICAN_AMERICA.AL32UTF8
sqlplus -s /@MY_DB << EOF > /dev/null
INSERT INTO ofdm_dw.Testing_emoji
(seq,nstring)
VALUES
(1,'????於本週Google針2025座?年歷'
);
INSERT INTO ofdm_dw.Testing_emoji
(seq,nstring)
VALUES
(2,n'????於本週Google針2025座?年歷'
);
commit;
EOF

Result using sqlplus:

Row 1: ????於本週Google針2025座?年歷

Row 2: ????於本週Google針2025座?年歷

Result using sqldeveloper:

Row 1:？？？？於本週Google針2025座？年歷

Row 2: 😊😔於本週Google針2025座枱年歷

I think your guess is right. Seems that there are some problem with sqlplus for converting AL32UTF8 to ZHT16BIG5 .

laurenz commented 3 months ago

Ah, I see. 1.2 is the extension version. So you aren't using ancient software, great.

What confuses me is that in your comment above, the INSERT statement you show has the question marks in it.

The other possible source of such problems is the configuration of your shell. So, to be entirely certain that it is an Oracle problem, make the experiment with a local PostgreSQL table and see if you get the same result:

CREATE TEMP TABLE test(s text);
INSERT INTO test VALUES ('😊😔於本週Google針2025座枱年歷');
SELECT convert_to(s, 'UTF8') FROM test;

The result should be

\xf09f988af09f9894e696bce69cace980b1476f6f676c65e9879d32303235e5baa7e69eb1e5b9b4e6adb7

If you get the same result, we can be certain that the string arrives in your PostgreSQL database correctly, and the data loss must happen between the Oracle client and the Oracle server.

A potential escape route would be to create a PostgreSQL database with a different encoding like EUC_TW and hope that Oracle has better conversion routines from ZHT32EUC to ZHT16BIG5, but 😊 cannot be encoded in EUC_TW, so that wouldn't get us anywhere.

I don't like to tell you that you'll have to complain to Oracle about that, because I know the odds you have of Oracle actually fixing that. Oracle has ways to customize character sets, but I didn't find anything about customizing the character set conversion between client and server.

The problem would disappear if the Oracle database had AL32UTF8 as database character set, but I guess that is not an option for you.

figoliu239 commented 3 months ago

Sorry for making you confuse. The emojis are correct in my linux server. I think there is a error when I copy the insert statement to github.

To ensure the data is right, I also converted the string like this:

CREATE TEMP TABLE test(s text);
INSERT INTO test VALUES ('😊😔於本週Google針2025座枱年歷');
SELECT encode(s::bytea,'hex') FROM test;

The result is

f09f988af09f9894e696bce69cace980b1476f6f676c65e9879d32303235e5baa7e69eb1e5b9b4e6adb7

So we can be sure that the data in the insert statement and postgreSQL are both correct.

For your recommend workaround, it is not possble because there are a lot of historical data in both postgreSQL and oracle server. The risk is too high to convert the characterset of the database.

‌ But there is a good news: I can now insert the emojis to the oracle db. I find that there is a option NCHAR after checking the manual and old issues.

I altered the NCHAR option to 'on' in the existing foreign server oracledband then reload the data.

ALTER SERVER oracledb options(ADD nchar 'on');

I can now see the correct emoji data in oracle database through sqldeveloper.

laurenz commented 3 months ago

Oh, interesting, and great that nchar = on makes it work for you.

I had briefly thought about that, but then I had thought that it only applies to single-byte character sets. I guess I will have to adapt the documentation.

So we can close the issue, right?

laurenz commented 3 months ago

Is that an NVARCHAR2 data type on the Oracle side?

figoliu239 commented 3 months ago

Yeah thanks man. We can close the issue.

figoliu239 commented 3 months ago

Yes the datatype in oracle side is NVARCHAR2 because the field is an free text field from frontend. It may contains any character for example chinese and emoji.

laurenz commented 3 months ago

Thank you for your feedback and your active help in resolving this problem!

laurenz / oracle_fdw

Emoji and chinese character cannot load to oracle #684