Failures with NCLOB text containing four byte Unicode emoji

zzzeek commented 2 years ago

Hi Anthony / Christopher -

I've narrowed down at least one of the failures I'm having re: LOB to the NCLOB datatype, and it seems to be independent of whether or not I'm using setinputsizes and whether or not I use an outputtype handler. What's most disturbing is that the failure is non-deterministic, failing only sometimes for large strings of text that have many random occurrences of characters in random orders.

This is using NLS_LANG=AMERICAN_AMERICA.AL32UTF8 so we would normally assume the full range of Unicode codepoints should round trip accurately.

The character in question is this: 🐍 , the Python snake (on some GUI elements it looks more like a duck, though if I look closely, it's still a snake).

Oracle server 18c, client version info:

>>> import sys
>>> import platform
>>> 
>>> print("platform.platform:", platform.platform())
platform.platform: Linux-5.14.9-200.fc34.x86_64-x86_64-with-glibc2.33
>>> print("sys.maxsize > 2**32:", sys.maxsize > 2**32)
sys.maxsize > 2**32: True
>>> print("platform.python_version:", platform.python_version())
platform.python_version: 3.10.0
>>> 
>>> import cx_Oracle
>>> print("cx_Oracle.version:", cx_Oracle.version)
cx_Oracle.version: 8.3.0
>>> print("cx_Oracle.clientversion:", cx_Oracle.clientversion())
cx_Oracle.clientversion: (19, 3, 0, 0, 0)

The test case below inserts and returns a row that contains the character embedded in a randomized field of characters. the larger the field of characters, the more chance that it fails. See sample run at the end.

import cx_Oracle
import random

conn = cx_Oracle.connect(user="scott", password="tiger", dsn="oracle18c")

def output_type_handler(cursor, name, default_type, size, precision, scale):
    if default_type in (
        cx_Oracle.CLOB,
        cx_Oracle.NCLOB,
    ):
        return cursor.var(cx_Oracle.LONG_STRING, size, cursor.arraysize)

# fetching LOB without the output type handler also doesn't help
conn.outputtypehandler = output_type_handler

def setup():
    cursor = conn.cursor()

    try:
        cursor.execute("drop table long_text")
    except:
        pass

    cursor.execute(
        """
    CREATE TABLE long_text (
        x INTEGER,
        y NCLOB,
        z INTEGER
    )
    """
    )

    cursor.close()

def run_test(datasize):

    cursor = conn.cursor()

    # the third character is the failure character
    word_seed = u"ab🐍’«cdefg"

    data = u" ".join(
        "".join(random.choice(word_seed) for j in range(150))
        for i in range(datasize)
    )

    # print(f"test data has {len(data.split('🐍'))} of the problematic character 🐍")

    # this actually doesn't help
    # cursor.setinputsizes(
    #     **{"x": cx_Oracle.NUMBER, "y": cx_Oracle.NCLOB, "z": cx_Oracle.NUMBER}
    # )

    cursor.execute(
        "INSERT INTO long_text (x, y, z) VALUES (:x, :y, :z)",
        {"x": 5, "y": data, "z": 10},
    )

    cursor.execute(
        "SELECT long_text.x, long_text.y, long_text.z FROM long_text"
    )
    row = cursor.fetchone()

    try:
        assert row[1] == data
    finally:
        cursor.close()
        conn.rollback()

def harness(datasize, num_runs):
    for i in range(num_runs):
        try:
            run_test(datasize)
        except AssertionError as err:
            print(f"datasize: {datasize} test run {i} failed")
        else:
            print(f"datasize: {datasize} test run {i} succeeded")

setup()

for datasize in (10, 25, 50, 100, 250):
    print(f"\ndatasize: {datasize}")
    harness(datasize, 10)

sample run

$ python test3.py 

datasize: 10
datasize: 10 test run 0 succeeded
datasize: 10 test run 1 succeeded
datasize: 10 test run 2 succeeded
datasize: 10 test run 3 succeeded
datasize: 10 test run 4 succeeded
datasize: 10 test run 5 succeeded
datasize: 10 test run 6 succeeded
datasize: 10 test run 7 succeeded
datasize: 10 test run 8 succeeded
datasize: 10 test run 9 succeeded

datasize: 25
datasize: 25 test run 0 succeeded
datasize: 25 test run 1 succeeded
datasize: 25 test run 2 succeeded
datasize: 25 test run 3 succeeded
datasize: 25 test run 4 succeeded
datasize: 25 test run 5 succeeded
datasize: 25 test run 6 succeeded
datasize: 25 test run 7 succeeded
datasize: 25 test run 8 succeeded
datasize: 25 test run 9 succeeded

datasize: 50
datasize: 50 test run 0 succeeded
datasize: 50 test run 1 succeeded
datasize: 50 test run 2 succeeded
datasize: 50 test run 3 succeeded
datasize: 50 test run 4 succeeded
datasize: 50 test run 5 succeeded
datasize: 50 test run 6 succeeded
datasize: 50 test run 7 succeeded
datasize: 50 test run 8 succeeded
datasize: 50 test run 9 succeeded

datasize: 100
datasize: 100 test run 0 failed
datasize: 100 test run 1 succeeded
datasize: 100 test run 2 succeeded
datasize: 100 test run 3 failed
datasize: 100 test run 4 succeeded
datasize: 100 test run 5 failed
datasize: 100 test run 6 succeeded
datasize: 100 test run 7 failed
datasize: 100 test run 8 failed
datasize: 100 test run 9 succeeded

datasize: 250
datasize: 250 test run 0 succeeded
datasize: 250 test run 1 failed
datasize: 250 test run 2 failed
datasize: 250 test run 3 failed
datasize: 250 test run 4 failed
datasize: 250 test run 5 failed
datasize: 250 test run 6 failed
datasize: 250 test run 7 failed
datasize: 250 test run 8 failed
datasize: 250 test run 9 failed

zzzeek commented 2 years ago

just to make sure I'm not doing anything silly I ported the above program to the mysqlclient driver and ran against MySQL with charset=utf8mb4, there's no failures on that platform, so this is definitely something on the Oracle side.

zzzeek commented 2 years ago

character set info reported by the server

>>> 
>>> cursor.execute("""
... SELECT value AS db_charset
... FROM nls_database_parameters
... WHERE parameter = 'NLS_CHARACTERSET'
... """)
<cx_Oracle.Cursor on <cx_Oracle.Connection to scott@oracle18c>>
>>> 
>>> cursor.fetchall()
[('AL32UTF8',)]
>>> cursor.execute("""
... SELECT value AS db_ncharset
... FROM nls_database_parameters
... WHERE parameter = 'NLS_NCHAR_CHARACTERSET'
... """)
<cx_Oracle.Cursor on <cx_Oracle.Connection to scott@oracle18c>>
>>> cursor.fetchall()
[('AL16UTF16',)]
>>> cursor.execute("""
... SELECT DISTINCT client_charset AS client_charset
... FROM v$session_connect_info
... WHERE sid = SYS_CONTEXT('USERENV', 'SID')
... """)
<cx_Oracle.Cursor on <cx_Oracle.Connection to scott@oracle18c>>
>>> cursor.fetchall()
[('AL32UTF8',)]

anthony-tuininga commented 2 years ago

I am able to replicate with the script you provided. It is not, in fact, intermittent, but very consistent. It depends on the data that is being supplied to the database -- it was intermittent for you simply because you are randomly creating the string to send to the database! I'll see what the problem is and report back once I figure it out. Thanks for supplying this to me!

zzzeek commented 2 years ago

Yes I figured there's some pattern in my random production of strings that causes the issue. looks like a fairly deep issue w/ the encoding.

cjbj commented 2 years ago

@anthony-tuininga has opened an Oracle bug against the DB itself. The number is 33691692

anthony-tuininga commented 2 years ago

To provide a little more information: the issue has nothing to do with insertng the data, only with fetching it using the output type handler.

zzzeek commented 2 years ago

OK for my script above, I can take the outputtypehandler away totally and it still fails in the same way; change the row[1] to row[1].read() and I get the same pattern of early success / late failures. can you try the script below?

import cx_Oracle
import random

conn = cx_Oracle.connect(user="scott", password="tiger", dsn="oracle18c")

def setup():
    cursor = conn.cursor()

    try:
        cursor.execute("drop table long_text")
    except:
        pass

    cursor.execute(
        """
    CREATE TABLE long_text (
        x INTEGER,
        y NCLOB,
        z INTEGER
    )
    """
    )

    cursor.close()

def run_test(datasize):

    cursor = conn.cursor()

    # the third character is the failure character
    word_seed = u"ab🐍’«cdefg"

    data = u" ".join(
        "".join(random.choice(word_seed) for j in range(150))
        for i in range(datasize)
    )

    cursor.execute(
        "INSERT INTO long_text (x, y, z) VALUES (:x, :y, :z)",
        {"x": 5, "y": data, "z": 10},
    )

    cursor.execute(
        "SELECT long_text.x, long_text.y, long_text.z FROM long_text"
    )
    row = cursor.fetchone()

    try:
        assert row[1].read() == data
    finally:
        cursor.close()
        conn.rollback()

def harness(datasize, num_runs):
    for i in range(num_runs):
        try:
            run_test(datasize)
        except AssertionError as err:
            print(f"datasize: {datasize} test run {i} failed")
        else:
            print(f"datasize: {datasize} test run {i} succeeded")

setup()

for datasize in (10, 25, 50, 100, 250):
    print(f"\ndatasize: {datasize}")
    harness(datasize, 10)

anthony-tuininga commented 2 years ago

I already tried that...and I don't get any failures that way! I am using 21.3 database, however, so not sure if this is something that may have been fixed? I'll try with an older database a little later and let you know.

zzzeek commented 2 years ago

im seeing it on 18c which is not too old

anthony-tuininga commented 2 years ago

Ok. I tried with a 12.1 database. It works fine, there. I did need to add this code:

cursor.setinputsizes(y=cx_Oracle.DB_TYPE_NVARCHAR)

That's because that database is using a single-byte character set, though. If the data type is NCLOB you should use that setinputsizes() call to avoid the double conversion (and possible data loss when the primary character set is a single-byte character set). For CLOB and BLOB no setinputsizes() call should be needed.

Since you're getting failures, can you show the first place in the string where things are failing?

zzzeek commented 2 years ago

are you able to try with an 18c database? it's the same test case, so I dont know what method you used to narrow down the pattern of characters that doesnt come back, I guess you could trap the error then compare the strings character by character?

cjbj commented 2 years ago

I just started 18c XE for @anthony-tuininga

(Also our doc on setinputsizes needs some love....)

anthony-tuininga commented 2 years ago

I am able to replicate with 18c. I will look into that as well and get back to you!

anthony-tuininga commented 2 years ago

@zzzeek, can you adjust your test case to add the following?

cursor.setinputsizes(y=cx_Oracle.DB_TYPE_NVARCHAR)

just prior to the execute that inserts the row into the database? That resolves the issue for me.

Similarly, for the output type handler, use this:

    if default_type in (oracledb.DB_TYPE_CLOB, oracledb.DB_TYPE_NCLOB):
        return cursor.var(oracledb.DB_TYPE_NVARCHAR, size, cursor.arraysize)

That resolves the fetch side. We'll look at improving the documentation and suggesting an improvement to SQLAlchemy to resolve these matters long-term -- please confirm that this resolves the issue for you, too! Thanks.

zzzeek commented 2 years ago

the second test works if I use DB_TYPE_NVARCHAR as well as if I use NCLOB. But DB_TYPE_NVARCHAR is the one that wont build up a separate statement handle, so I should use that one.

the original test is giving me a new error though:

 cx_Oracle.DatabaseError: DPI-1037: column at array position 0 fetched with error 1406

here's exactly how I am running it:

import cx_Oracle
import random

conn = cx_Oracle.connect(user="scott", password="tiger", dsn="oracle18c")

def output_type_handler(cursor, name, default_type, size, precision, scale):
    if default_type in (
        cx_Oracle.CLOB,
        cx_Oracle.NCLOB,
    ):
        return cursor.var(cx_Oracle.DB_TYPE_NVARCHAR, size, cursor.arraysize)

# fetching LOB without the output type handler also doesn't help
conn.outputtypehandler = output_type_handler

def setup():
    cursor = conn.cursor()

    try:
        cursor.execute("drop table long_text")
    except:
        pass

    cursor.execute(
        """
    CREATE TABLE long_text (
        x INTEGER,
        y NCLOB,
        z INTEGER
    )
    """
    )

    cursor.close()

def run_test(datasize):

    cursor = conn.cursor()

    # the third character is the failure character
    word_seed = u"ab🐍’«cdefg"

    data = u" ".join(
        "".join(random.choice(word_seed) for j in range(150))
        for i in range(datasize)
    )

    # print(f"test data has {len(data.split('🐍'))} of the problematic character 🐍")

    # this actually doesn't help
    cursor.setinputsizes(
        **{"x": cx_Oracle.NUMBER, "y": cx_Oracle.DB_TYPE_NVARCHAR, "z": cx_Oracle.NUMBER}
    )

    cursor.execute(
        "INSERT INTO long_text (x, y, z) VALUES (:x, :y, :z)",
        {"x": 5, "y": data, "z": 10},
    )

    cursor.execute(
        "SELECT long_text.x, long_text.y, long_text.z FROM long_text"
    )
    row = cursor.fetchone()

    try:
        assert row[1] == data
    finally:
        cursor.close()
        conn.rollback()

def harness(datasize, num_runs):
    for i in range(num_runs):
        try:
            run_test(datasize)
        except AssertionError as err:
            print(f"datasize: {datasize} test run {i} failed")
        else:
            print(f"datasize: {datasize} test run {i} succeeded")

setup()

for datasize in (10, 25, 50, 100, 250):
    print(f"\ndatasize: {datasize}")
    harness(datasize, 10)

anthony-tuininga commented 2 years ago

Replace the line in the output type handler that creates the variable with this one:

        return cursor.var(cx_Oracle.DB_TYPE_NVARCHAR, 131072, cursor.arraysize)

That appears to resolve the issue. The size has to be longer than 32767 in order for the "LONG" conversion to kick in. With that change your test works on the 18c database that Chris provided -- so in theory it should work for you, too. We'll look into making that a little less messy!

zzzeek commented 2 years ago

okey doke, that works, so this gives me a path to removing the BLOB setinputsize. 131072 is...number of units for a 1G size? it would be helpful if I could document the origin of this number ?

zzzeek commented 2 years ago

that is, I hope this number isn't a fixed buffer size I have to estimate based on data size, since we dont have that

anthony-tuininga commented 2 years ago

No, it just has be a number larger than 32767. Any number will do. :-) I don't like that myself so will see about making it a little less "disgusting". ;-) Thanks for confirming that the adjusted code works for you!

oracle / python-cx_Oracle

Failures with NCLOB text containing four byte Unicode emoji #596