csv reading croaks on unicode

gregory-marton commented 8 years ago

schools0 = quickstart('artwork0', csv_path='artwork_data.csv', bdb_path='artwork0.bdb')

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-7-47d436e73eb2> in <module>()
----> 1 schools0 = quickstart('artwork0', csv_path='artwork_data.csv', bdb_path='artwork0.bdb')

/Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bdbcontrib/recipes.pyc in quickstart(*args, **kwargs)
   359 
   360 def quickstart(*args, **kwargs):
--> 361     return BqlRecipes(*args, **kwargs)

/Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bdbcontrib/recipes.pyc in __init__(self, name, csv_path, bdb_path, df, logger)
    67     self.bdb = None
    68     self.status = None
---> 69     self.initialize()
    70 
    71   def initialize(self):

/Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bdbcontrib/recipes.pyc in initialize(self)
    83       else:
    84         raise ValueError("No data sources specified, and an empty bdb.")
---> 85     bdbcontrib.nullify(self.bdb, self.name, "")
    86     size = self.q('''SELECT COUNT(*) FROM %t''').ix(0, 0)
    87     assert 0 < size

/Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bdbcontrib/bql_utils.pyc in nullify(bdb, table, value)
    93             bql = '''
    94                 UPDATE {} SET {} = NULL WHERE {} = ?;
---> 95             '''.format(quote(table), quote(col), quote(col))
    96             bdb.sql_execute(bql, (value,))
    97

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)

Recommendation to go along with wanting to be py3 compatible: let pandas handle this, and stop supporting direct csv reads?

raxraxraxraxrax commented 8 years ago

Same issue in pandas, so it's not (just) a csv reading issue. Also, when I remove the BOM, I get a related error, even when working with pandas:

UnicodeDecodeError Traceback (most recent call last)

in () 2 import pandas 3 artwork_df = pandas.read_csv('artwork_data.csv') ----> 4 artwork1 = quickstart('artwork1', df=artwork_df, bdb_path='artwork1.bdb') /Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bdbcontrib/recipes.pyc in quickstart(_args, *_kwargs) 359 360 def quickstart(_args, *_kwargs): --> 361 return BqlRecipes(_args, *_kwargs) /Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bdbcontrib/recipes.pyc in **init**(self, name, csv_path, bdb_path, df, logger) 67 self.bdb = None 68 self.status = None ---> 69 self.initialize() 70 71 def initialize(self): /Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bdbcontrib/recipes.pyc in initialize(self) 76 if self.df is not None: 77 bayeslite.read_pandas.bayesdb_read_pandas_df( ---> 78 self.bdb, self.name, self.df, create=True, ifnotexists=True) 79 elif self.csv_path: 80 bayeslite.bayesdb_read_csv_file( /Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bayeslite/read_pandas.pyc in bayesdb_read_pandas_df(bdb, table, df, create, ifnotexists, index) 75 for ccn, qccn in zip(create_column_names, qccns)) 76 qt = sqlite3_quote_name(table) ---> 77 bdb.sql_execute('CREATE TABLE %s(%s)' % (qt, schema)) 78 core.bayesdb_table_guarantee_columns(bdb, table) 79 else: /Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bayeslite/bayesdb.pyc in sql_execute(self, string, bindings) 278 bindings = () 279 return self._maybe_trace( --> 280 self.sql_tracer, self._do_sql_execute, string, bindings) 281 282 def _do_sql_execute(self, string, bindings): /Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bayeslite/bayesdb.pyc in _maybe_trace(self, tracer, meth, string, bindings) 220 if tracer: 221 tracer(string, bindings) --> 222 return meth(string, bindings) 223 224 def _qid(self): /Volumes/Bayeslite/Bayeslite-v0.1.5.app/Contents/MacOS/venv/lib/python2.7/site-packages/bayeslite/bayesdb.pyc in _do_sql_execute(self, string, bindings) 282 def _do_sql_execute(self, string, bindings): 283 cursor = self._sqlite3.cursor() --> 284 cursor.execute(string, bindings) 285 return bql.BayesDBCursor(self, cursor) 286 src/cursor.c in APSWCursor_execute.sqlite3_prepare() UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 25: ordinal not in range(128)

gregory-marton commented 8 years ago

So it's reading the column names in the first line just fine, and croaking when we try to use that column name with a BOM in a query. I think it's a reasonable requirement for the moment, though still a bug in the long term, that column names should be ascii-only. Suggest replacing the line of headers with ascii names?

gregory-marton commented 8 years ago

If the content also needs to be ascii only, that's a bigger concern.

gregory-marton commented 8 years ago

So it turns out there was nothing special to be encoded wierdly. I opened the csv in emacs, said M-x set-buffer-file-coding-system utf-8, M-x set-buffer-file-coding-system undecided-unix, and then tried again, and everything looked a lot happier. Sending the file back...

raxraxraxraxrax commented 8 years ago

We actually ended up having to convert the file lossily all the way down to ASCII. I think this may be the "bigger concern" case. Do we have any tests around this?

riastradh-probcomp commented 8 years ago

This is very puzzling. I can't imagine how you could have gotten to that place without an earlier failure. Can you share the CSV file?

raxraxraxraxrax commented 8 years ago

Github will not let me upload it, but I have sent it to you on flowdock, @riastradh-probcomp . It's artwork_data.csv.

riastradh-probcomp commented 8 years ago

Well, that was easy.

>>> bdb.sql_execute(u'create table foo(\ufeffquagga text)')
<bayeslite.bql.BayesDBCursor object at 0x7f891b8a79d0>
>>> bdb.sql_execute('pragma table_info(foo)').fetchall()
[(0, u'\ufeffquagga', u'text', 0, None, 0)]

Amazingly, it happens to work if you use '...' % (...) instead of '...'.format(...).

>>> bdb.sql_execute('select %s from foo' % (u'\ufeffquagga',)).fetchall()
[]
>>> bdb.sql_execute('select {} from foo'.format(u'\ufeffquagga',)).fetchall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

gregory-marton commented 8 years ago

Sadly, that's a non-solution for the future.

http://stackoverflow.com/questions/12382719/python-way-of-printing-with-format-or-percent-form

riastradh-probcomp commented 8 years ago

The really really really easy stupid way to work around this would be to just eliminate the broken BOM at the beginning of the file, which shouldn't be there anyway -- BOMs are intended to exist only in UTF-16 or UTF-32, not in UTF-8.

It would be nice if we could prevent anyone from accidentally creating a table with a non-US-ASCII column name, since most of the code is not set up to handle that. Unfortunately, as I demonstrated, this is always possible via sql_execute, and someone is likely to do it via sql_execute, e.g. in the CSV or pandas import code.

Another approach would be to consistently use % instead of .format to suppress most manifestations of this issue, and hope that we don't run into, e.g., inconsistent normalization during encoding or decoding, and pretend that the problem doesn't exist for a while longer.

Another approach would be to decide how we plan to seriously support non-US-ASCII column names. Unfortunately, since sqlite3 is so permissive about column names, we probably can't just define a bijection between Unicode strings and sqlite3 column names as octet sequences.

Whatever resolution we choose, we should of course add some tests to exercise the intentionally supported cases and reasonable error messages for a reasonable class of intentionally (if only for the moment) unsupported cases like this one.

riastradh-probcomp commented 8 years ago

I think it is pretty unlikely that % will actually go away in any manifestation of Python that anyone cares about for the foreseeable future. That would break so much deployed code it is absurd.

riastradh-probcomp commented 8 years ago

It appears that apsw will reject non-UTF-8 SQL queries, although the sqlite3 module will not. Both of them will by default raise an exception if a query returns a value with TEXT affinity that is not a valid UTF-8 octet sequence. Since we're using apsw, not the sqlite3 module, I'm inclined to say that maybe we should just require all column names to be Unicode strings -- represented in sqlite3 by valid UTF-8 octet sequences -- and to do whatever work is needed to sure that works consistently everywhere.

It is possible that someone might make a database, outside our purview, that causes this to fail -- but the same is true anyway in more ways than we can count. It is possible that someone made such a database using bayeslite before we switched to apsw, but I doubt it -- and in any case it would have caused failures with the sqlite3 module then anyway.

probcomp / bayeslite

csv reading croaks on unicode #366