ibmdbanalytics / ibmdbpy

A Pandas-like SQL-wrapper for in-database analytics with IBM Db2.
https://pythonhosted.org/ibmdbpy/
Other
23 stars 11 forks source link

Unicode, large columns, and BOOLEAN columns cause exceptions #40

Open theoharr opened 6 years ago

theoharr commented 6 years ago

My system:

CentOS Linux release 7.3.1611 (Core)
ibmdbpy==0.1.4
pyodbc==4.0.24
pypyodbc==1.3.4
db2wh-smp = ibmdashdb/local:v2.5.0-20180222-2218-local

I believe most of the limitations are actually due to problems in the odbc connector (pypyodbc). In any case, the problems I've identified are:

  1. Unicode characters are not returned
  2. Large fields cause an exception. For example, it appears any VARCHARS that are longer than 1024 chars hit this problem.
  3. BOOLEAN columns cause an exception

I set up this table to test against:

create table tjhtest (name VARCHAR(128), bigfield VARCHAR(4096), bool BOOLEAN);
insert into tjhtest (name, bigfield, bool) values('👻♩  ⓗ𝕖ℓlO  🍩★', 'this has unicode', True);
insert into tjhtest (name, bigfield, bool) values('this is a large (2kb) field', 'MMwv7rwkfsaWm4oIELFVZ44ZTEOCHdNNolRQuEP7dExp4evmIi4S6NtMI6fpFguxepo1hit6fklBTOP88OG4snaihhOA28n9J8nmgzfbGbb2Mho557u4lImfWH1D6D9B9NueSoj1Hmh0AJ8lMPOY824yKrXDK7929HjVEv7Ucyan9OkWN8wZHrYj5CbEqJYgw5izKypgzFGT2n5CjoT517UCXhPAj4S0zqllrQIsobNTS48bNtsLE1FADKAPSfHiTARNGaERUsrvAsnBBM4HaSmiY81amAaSPeYiBRuiC8wESXqkmUGTpr9hcKyr6eHVd2v1s3TyNe4w50doGDLHVNBgPQKOAhyC8O4uY9M8qeABi04ZYbiXekqIZHNvSrvmSZfXfEbjbRcqNgQEr8mld6gZGfq32aoMRuzHu2PF0a6bRNNUGrkrxJfAAIPARcu2bFDvVrPyPtAB9hAm2NBXaafjsvgcDBBZRxJnsx2wWdyhUtVErNN6hatX4VuLee8lSzKaS4HFb6hqlF8g3ebJHkNDbF3qCAXGtrFTB4yiybaHabMhzNZq5pXbOKnVIO0plIBqDiKeRIfNo6Hiy26nwlwm8afAwSsBt1DhqLfk4WZ4OG8kUtqdaRJOBeQJRqKvissTjOM3oF5gqHgiJybqjn8uwHyjZmJMgbuEQosVaBsQytdItWMBfpbKQXdctpg7kOmG1kFx6SV19yR2vLIHQFK8MBBm6UbajOgE59yHnLZUzbaQmq2xq2RLZ7S1FmRt4s3rNBYE33P62cGYcqf6khTnJ2sy2gmkCIOHK1akcYc2oi9GcmLGfHBR812AIF7RQ3yhLkQ2Q0EGRG9sV3A6xAk0FgtgS4cWhLzfKllX7RH4inXbwy4hBf3jywUYQuxH3Efm0f1GecJRgFdNqi1I3NvVeUnuGWmELZJqV3tJlrLdWdbx047tBQlDqBlOOPW4Wcv6JedmeHeguId6HumeiiDMqSkH94IG3djkeOIb8t4fL125rjwVxVeNp37CR7zNwwTIcVNxJj2fwNneZQ8j4hP22pHlCFdVDLPZOvbqsj9nddcOUZhdGnYTzMNcvuuxpPSeZSKdyP351p8dI6tgPGrSYkbdiGdOWxcBwg2nTWFzVD5vaH5Zn1LR8Fo4JXs7LP74PAhyqr125jjDFxQNB0SmoINvvyjPCS7wd8igZfF7BME7nodMsbjoJNx87BVxu7iSQJnB5Vck4DG3av5ZDlfmOsVD85a2r50P5m5g2dcaSWmkw1u1ux5zaRUESXG4nLh9yMdnG0veQ1wtXDpWAqg72AFDuG8wTH3SV5pzRaShinhP8bHBlH7Vp9JNi7oBTF0iq5UuSSjVVb7ld7bz7w7BkWYW4QSvINUp3RgdtCW1IrYGqyQODVPiOFTUZmHTG4oSMsCO0JywatEH41QqzpIyrpnLNcQLaV6mcFex2DqMjo2U4CkdQYGHUxLRqrLdt09AZ5BIHXGpIvMo3lFVCr2eqvpJfhT4WHRupkrAND8G80Vaver02xbJYgqQshkB4To9PJ59XvPLPR38KWKAWaS4iXB0yTiWwwwfxNIDqS2bW7eLqv0J7V7HCzbAOIqNG06BM1NleOIuvGnxTUutDBIhJwGCW0TaaxpiSsJGTarLMWv3VgenJfrFxE5xdW3dV14QFFNXLjzAxoY3OTgkBCIj3SKdE4tMcWgb7ACq0xcyQCopTNtnkokItjP8qKgezfA2KEUnA7r2QrFOSM27MEI3xlgsXOpQRujeGAuHxjAhGdsIc3LmhIj7wMtwA5kjrfVuUS724lrJoA7dOawzQgRiBTY55efKEvc2uGyzmMWb9oOj5ivLDMk4JA8TK9N1jTgNqr4nPzfugKyLBmeHNuWmOk3Cu0ecQBdeEOi4hzSEiLWFEWCs5UUmNznHhAaFwhO3EQFNnunnf3gAp8VnX3IEK4kcJEoDAf1d4m2ZSZeQp4sXjajc8TCZZTvfrtsPnz1kqhx4zY5GnA0cVFyfyQbGywzJIE94JbTBho4P2Bn7uC1WHchIii52gnb3xGXC', False);

If I query the table from the db2wh ui I get:

select * from tjhtest;

image

When I query the name, I get nothing:

query = "SELECT name from tjhtest"
idadb = IdaDataBase('dashdb',uid='bluadmin',pwd='<PW>',verbose=True)
print(idadb.ida_query(query))
> SELECT name from tjhtest
None

When I query the large field, I get an exception:

query = "SELECT bigfield from tjhtest"
idadb = IdaDataBase('dashdb',uid='bluadmin',pwd='<PW>',verbose=True)
print(idadb.ida_query(query))

> SELECT bigfield from tjhtest
Traceback (most recent call last):
  File "odbctest.py", line 51, in <module>
    print(idadb.ida_query(query))
  File "/usr/lib/python3.4/site-packages/ibmdbpy/base.py", line 762, in ida_query
    return sql.ida_query(self, query, silent, first_row_only, autocommit)
  File "/usr/lib/python3.4/site-packages/ibmdbpy/sql.py", line 81, in ida_query
    return _ida_query_ODBC(idadb, query, silent, first_row_only, autocommit)
  File "/usr/lib/python3.4/site-packages/ibmdbpy/sql.py", line 118, in _ida_query_ODBC
    result = read_sql(query, idadb._con)
  File "/usr/lib64/python3.4/site-packages/pandas/io/sql.py", line 400, in read_sql
    chunksize=chunksize)
  File "/usr/lib64/python3.4/site-packages/pandas/io/sql.py", line 1453, in read_query
    data = self._fetchall_as_list(cursor)
  File "/usr/lib64/python3.4/site-packages/pandas/io/sql.py", line 1462, in _fetchall_as_list
    result = cur.fetchall()
  File "/usr/lib/python3.4/site-packages/pypyodbc.py", line 1840, in fetchall
    row = self.fetchone()
  File "/usr/lib/python3.4/site-packages/pypyodbc.py", line 1921, in fetchone
    raw_value = ''.join(raw_data_parts)
TypeError: sequence item 0: expected str instance, bytes found

When I query the BOOLEAN, I also get an exception:

query = "SELECT bool from tjhtest"
idadb = IdaDataBase('dashdb',uid='bluadmin',pwd='<PW>',verbose=True)
print(idadb.ida_query(query))

> SELECT bool from tjhtest
Traceback (most recent call last):
  File "odbctest.py", line 51, in <module>
    print(idadb.ida_query(query))
  File "/usr/lib/python3.4/site-packages/ibmdbpy/base.py", line 762, in ida_query
    return sql.ida_query(self, query, silent, first_row_only, autocommit)
  File "/usr/lib/python3.4/site-packages/ibmdbpy/sql.py", line 81, in ida_query
    return _ida_query_ODBC(idadb, query, silent, first_row_only, autocommit)
  File "/usr/lib/python3.4/site-packages/ibmdbpy/sql.py", line 94, in _ida_query_ODBC
    cursor.execute(query)
  File "/usr/lib/python3.4/site-packages/pypyodbc.py", line 1626, in execute
    self.execdirect(query_string)
  File "/usr/lib/python3.4/site-packages/pypyodbc.py", line 1654, in execdirect
    self._UpdateDesc()
  File "/usr/lib/python3.4/site-packages/pypyodbc.py", line 1806, in _UpdateDesc
    self._CreateColBuf()
  File "/usr/lib/python3.4/site-packages/pypyodbc.py", line 1712, in _CreateColBuf
    target_type = SQL_data_type_dict[col_sql_data_type][2]
KeyError: 16

To get around this, I swapped out the odbc connector ... I replaced pypyodbc with pyodbc:

diff --git a/ibmdbpy/base.py b/ibmdbpy/base.py
index 2d39169..f6dc59d 100644
--- a/ibmdbpy/base.py
+++ b/ibmdbpy/base.py
@@ -180,9 +180,10 @@ class IdaDataBase(object):
             SQL_BLOB to SQL_LONGVARBINARY
             SQL_DBCLOB to SQL_WLONGVARCHAR
             """
-            import pypyodbc
+            import pyodbc
+            pyodbc.lowercase=True
             try :
-                self._con = pypyodbc.connect(self._connection_string)
+                self._con = pyodbc.connect(self._connection_string)
             except Exception as e:
                 raise IdaDataBaseError(e.value[1])

@@ -1462,9 +1463,10 @@ class IdaDataBase(object):
             self._check_connection()
         except IdaDataBaseError:
             if self._con_type == 'odbc':
-                import pypyodbc
+                import pyodbc
+                pyodbc.lowercase=True
                 try:
-                    self._con = pypyodbc.connect(self._connection_string)
+                    self._con = pyodbc.connect(self._connection_string)
                 except:
                     raise
                 else:

Once I made that change, the first two tests worked ... BOOLEAN still seems to have problems:

> SELECT name from tjhtest
0                  ♩  ⓗ𝕖ℓlO  ★
1    this is a large (2kb) field
Name: name, dtype: object

> SELECT bigfield from tjhtest
0                                     this has unicode
1    MMwv7rwkfsaWm4oIELFVZ44ZTEOCHdNNolRQuEP7dExp4e...
Name: bigfield, dtype: object

> SELECT bool from tjhtest
None
mineo commented 6 years ago

Some quick investigation:

The first query is only returning None because of bad code in ibmdbpy:

https://github.com/ibmdbanalytics/ibmdbpy/blob/d3ddbe895bf3955e9413c171c4ee990cf9ab1b73/ibmdbpy/sql.py#L100-L101

This catches any and all exceptions during result set retrieval and ignores it. Catching and logging the exception provides a bit more detail:

> SELECT name from tjhtest
'utf-16-le' codec can't decode bytes in position 0-1: unexpected end of data
None

Number 2 (the large field) is https://github.com/jiangwen365/pypyodbc/blob/2aea5719369405743a1420cc2570f1e62f3f4d6f/pypyodbc.py#L1937-L1938. Not sure what's wrong there.

Number 3 (the boolean) works on my setup:

root@b081fa76a550:/code/ibmdbpy/tests# python
Python 3.6.4 (default, Feb 15 2018, 12:56:09) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ibmdbpy import IdaDataBase; idadb = IdaDataBase([...] verbose=True)
>>> query = "SELECT bool from tjhtest"
>>> print(idadb.ida_query(query))
> SELECT bool from tjhtest
0    1
1    0

That's with Db2 Warehouse 2.11, ibmdbpy 0.1.5, pypyodbc 1.3.4 on Python 3.5.

theoharr commented 6 years ago

For Number 2 ... I'm pretty sure the problem is here

https://github.com/jiangwen365/pypyodbc/blob/master/pypyodbc.py#L632

I appears pypyodbc always allocates a 2K buffer for VARCHARs. I believe db2wh uses utf8 encoding by default ... so each char might be 2 bytes? This would explain why things seem to fail for any VARCHAR larger than 1024 characters.

theoharr commented 6 years ago

Hmm, I'm confused. I'm not sure why Number 3 works for you but fails for me. I upgraded my ibmdbpy version to 0.1.5...and my python version is very close. Maybe it has something to do with the db2wh version?

[moadmin@metaocean-tjh ~]$ pip freeze | egrep "ibmdbpy|pypy"
ibmdbpy==0.1.5
pypyodbc==1.3.4

It consistently fails for me:

[moadmin@metaocean-tjh ~]$ python3
Python 3.4.9 (default, Aug 14 2018, 21:28:57)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ibmdbpy import IdaDataBase;
>>> idadb = IdaDataBase('dashdb',uid='bluadmin',pwd='<PW>',verbose=True)
>>> query = "SELECT bool from tjhtest"
>>> print(idadb.ida_query(query))
> SELECT bool from tjhtest
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/ibmdbpy/base.py", line 762, in ida_query
    return sql.ida_query(self, query, silent, first_row_only, autocommit)
  File "/usr/lib/python3.4/site-packages/ibmdbpy/sql.py", line 81, in ida_query
    return _ida_query_ODBC(idadb, query, silent, first_row_only, autocommit)
  File "/usr/lib/python3.4/site-packages/ibmdbpy/sql.py", line 94, in _ida_query_ODBC
    cursor.execute(query)
  File "/usr/lib/python3.4/site-packages/pypyodbc.py", line 1626, in execute
    self.execdirect(query_string)
  File "/usr/lib/python3.4/site-packages/pypyodbc.py", line 1654, in execdirect
    self._UpdateDesc()
  File "/usr/lib/python3.4/site-packages/pypyodbc.py", line 1806, in _UpdateDesc
    self._CreateColBuf()
  File "/usr/lib/python3.4/site-packages/pypyodbc.py", line 1712, in _CreateColBuf
    target_type = SQL_data_type_dict[col_sql_data_type][2]
KeyError: 16

Can you run this query in your env too?

>>> query = "SELECT colname, typename, length  FROM syscat.columns where tabschema='BLUADMIN' AND tabname='TJHTEST' order by colno"
>>> print(idadb.ida_query(query))
> SELECT colname, typename, length  FROM syscat.columns where tabschema='BLUADMIN' AND tabname='TJHTEST' order by colno
    colname typename  length
0      NAME  VARCHAR     128
1  BIGFIELD  VARCHAR    4096
2      BOOL  BOOLEAN       1

I believe the problem is that pypyodbc doesn't have any support for the BOOLEAN type:

https://github.com/jiangwen365/pypyodbc/blob/master/pypyodbc.py#L617

mineo commented 6 years ago

I get the same output as you when querying the system catalog. Do you have access to a system with a more recent Db2 Warehouse container that you could test on? If not, please let me know off-Github and I can find you a system for a few hours.