Open gregory-marton opened 8 years ago
Same issue in pandas, so it's not (just) a csv reading issue. Also, when I remove the BOM, I get a related error, even when working with pandas:
UnicodeDecodeError Traceback (most recent call last)
So it's reading the column names in the first line just fine, and croaking when we try to use that column name with a BOM in a query. I think it's a reasonable requirement for the moment, though still a bug in the long term, that column names should be ascii-only. Suggest replacing the line of headers with ascii names?
If the content also needs to be ascii only, that's a bigger concern.
So it turns out there was nothing special to be encoded wierdly. I opened the csv in emacs, said M-x set-buffer-file-coding-system utf-8, M-x set-buffer-file-coding-system undecided-unix, and then tried again, and everything looked a lot happier. Sending the file back...
We actually ended up having to convert the file lossily all the way down to ASCII. I think this may be the "bigger concern" case. Do we have any tests around this?
This is very puzzling. I can't imagine how you could have gotten to that place without an earlier failure. Can you share the CSV file?
Github will not let me upload it, but I have sent it to you on flowdock, @riastradh-probcomp . It's artwork_data.csv.
Well, that was easy.
>>> bdb.sql_execute(u'create table foo(\ufeffquagga text)')
<bayeslite.bql.BayesDBCursor object at 0x7f891b8a79d0>
>>> bdb.sql_execute('pragma table_info(foo)').fetchall()
[(0, u'\ufeffquagga', u'text', 0, None, 0)]
Amazingly, it happens to work if you use '...' % (...)
instead of '...'.format(...)
.
>>> bdb.sql_execute('select %s from foo' % (u'\ufeffquagga',)).fetchall()
[]
>>> bdb.sql_execute('select {} from foo'.format(u'\ufeffquagga',)).fetchall()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
Sadly, that's a non-solution for the future.
http://stackoverflow.com/questions/12382719/python-way-of-printing-with-format-or-percent-form
The really really really easy stupid way to work around this would be to just eliminate the broken BOM at the beginning of the file, which shouldn't be there anyway -- BOMs are intended to exist only in UTF-16 or UTF-32, not in UTF-8.
It would be nice if we could prevent anyone from accidentally creating a table with a non-US-ASCII column name, since most of the code is not set up to handle that. Unfortunately, as I demonstrated, this is always possible via sql_execute, and someone is likely to do it via sql_execute, e.g. in the CSV or pandas import code.
Another approach would be to consistently use %
instead of .format
to suppress most manifestations of this issue, and hope that we don't run into, e.g., inconsistent normalization during encoding or decoding, and pretend that the problem doesn't exist for a while longer.
Another approach would be to decide how we plan to seriously support non-US-ASCII column names. Unfortunately, since sqlite3 is so permissive about column names, we probably can't just define a bijection between Unicode strings and sqlite3 column names as octet sequences.
Whatever resolution we choose, we should of course add some tests to exercise the intentionally supported cases and reasonable error messages for a reasonable class of intentionally (if only for the moment) unsupported cases like this one.
I think it is pretty unlikely that % will actually go away in any manifestation of Python that anyone cares about for the foreseeable future. That would break so much deployed code it is absurd.
It appears that apsw will reject non-UTF-8 SQL queries, although the sqlite3 module will not. Both of them will by default raise an exception if a query returns a value with TEXT affinity that is not a valid UTF-8 octet sequence. Since we're using apsw, not the sqlite3 module, I'm inclined to say that maybe we should just require all column names to be Unicode strings -- represented in sqlite3 by valid UTF-8 octet sequences -- and to do whatever work is needed to sure that works consistently everywhere.
It is possible that someone might make a database, outside our purview, that causes this to fail -- but the same is true anyway in more ways than we can count. It is possible that someone made such a database using bayeslite before we switched to apsw, but I doubt it -- and in any case it would have caused failures with the sqlite3 module then anyway.
Recommendation to go along with wanting to be py3 compatible: let pandas handle this, and stop supporting direct csv reads?