Open aleksandrs-ledovskis opened 5 months ago
Thanks. I didn't notice that the CESU-8 encoding was supported in ruby.
As for your issue, I suggest that you should use AL32UTF8 in ruby.
# Set NLS_LANG before any "require 'oci8'".
# When once oci8 is required, changing NLS_LANG has no effect.
ENV['NLS_LANG'] = 'american_america.AL32UTF8'
require 'oci8'
There is no difference as a result when 🙈
is inserted in the following two cases.
OCI8.encoding = Encoding::CESU_8
and UTF8 NLS languageOCI8.encoding = Encoding::UTF_8
and AL32UTF8 NLS languageIn the case 1, 🙈
is converted to CESU-8 in ruby-oci8 and sent to the DB.
In the case 2, 🙈
is sent to the DB and converted to CESU-8 in the DB.
However when 🙈
is fetched, there is difference.
In the case 1, 🙈
is fetched as CESU-8. You need to convert it to UTF-8 or so.
On the other hand, in the case 2, 🙈
is fetched as UTF-8. It is more usable.
Well, I tested above in Oracle Database Free 23c. I think that it is same in Oracle 19c but I haven't confirmed it.
In one our project multiple applications are sharing a single large Oracle 19c DB - this includes both Ruby and non-Ruby apps. Recently, we were faced with a curious bug where emoji character persisted in DB from non-Ruby application would be unprocessable (with
"source sequence is illegal/malformed utf-8"
error) when read back in Ruby (on Rails) application.After some debugging it was found that an emoji (🙈) that was saved from non-Ruby application is stored in DB as raw bytes
EDA0BDEDB988
. The same emoji when saved from Ruby application was written to DB instead asF09F9988
. Per helpful table it was then understood that Ruby application writes data in "UTF-8" encoding, but non-Ruby apps are using a "CESU-8" encoding.We took a look (
SELECT * FROM nls_database_parameters
) afterwards at a DB configuration of "CHARACTERSET" values:Obviously, in Oracle's infinite knowledge the "UTF8" here doesn't mean "UTF-8", but rather a weird "CESU-8" encoding form.
Which lead to a curious discovery that our Ruby application is likely using an incorrect connection encoding at OCI8 level. Per our understanding and tests, the
OCI8
gem assigns connection's encoding in "encoding-init.rb" bootup code. If we modified/used an initializer of our own that forcedOCI8.encoding = Encoding::CESU_8
, we could adequately read the original "CESU-8" encoded form of characters and new persistance from Ruby side would also follow the "CESU-8" conventions.In summary, I would like to know, if this line
https://github.com/kubo/ruby-oci8/blob/959482365e18d7f25796de39ebb3908cd674b5e5/lib/oci8/encoding.yml#L388
shouldn't rather read as
Ruby 2.7 added a "CESU-8" encoding, albeit with a slightest caveat concerning handling of
NULL
characters.Information on Ruby, "CESU-8" and Oracle DB is very sparse, so any discussion or suggestion would be helpful!