lmaurits / BEASTling

A linguistics-focussed command line tool for generating BEAST XML files.
BSD 2-Clause "Simplified" License
20 stars 6 forks source link

Fix encoding issue in datareaders.sniffer #228

Closed Anaphory closed 4 years ago

Anaphory commented 5 years ago

Fixes #227

xrotwang commented 5 years ago

I recently came across a similar problem with the csv sniffer. It's really annoying how difficult cross-platform csv manipulation seems to be.

Anaphory commented 5 years ago

This was also affected by #215 so I pushed the same fix that helped there.

Anaphory commented 5 years ago

The latest commit fails in Python2.7, because the dialect

(Pdb) vars(self.kw['dialect'])
{'__module__': 'csv', 'lineterminator': '\r\n', 'skipinitialspace': False, 'quoting': 0, 'encoding': 'ascii', '_name': 'sniffed', 'delimiter': u',', 'quotechar': u'"', '__doc__': None, 'doublequote': False}

passed to csv.readerpython2.7/site-packages/csvw/dsv.py:

224  ->         self.reader = csv.reader(self.f, **self.kw)

contains unicode objects for delimiter and quotechar, and that is apparently not allowed. I don't know how to fix that.

xrotwang commented 5 years ago

csvw.dsv handles that somewhat. But I'd think at this point, dropping py2 may be more reasonable.

Am Fr., 23. Nov. 2018, 15:15 hat Gereon Kaiping notifications@github.com geschrieben:

The latest commit fails in Python2.7, because the dialect

(Pdb) vars(self.kw['dialect']) {'module': 'csv', 'lineterminator': '\r\n', 'skipinitialspace': False, 'quoting': 0, 'encoding': 'ascii', '_name': 'sniffed', 'delimiter': u',', 'quotechar': u'"', 'doc': None, 'doublequote': False}

passed to csv.reader – python2.7/site-packages/csvw/dsv.py:

224 -> self.reader = csv.reader(self.f, **self.kw)

contains unicode objects for delimiter and quotechar, and that is apparently not allowed. I don't know how to fix that.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/lmaurits/BEASTling/pull/228#issuecomment-441251487, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1HKHIKW2iV3nX47DHLfAm2bqJbAKo_ks5uyAMUgaJpZM4Ywk-e .

lmaurits commented 5 years ago

The dialect object we get back from the sniffer now has an encoding attribute, thanks to Gereon's recent commit, right? Can't we just use that to encode the delimiter and quote char before handing it to the reader? Or is the dialect immutable?

xrotwang commented 5 years ago

I think we don't actually need the encoding to be used - the relevant entries are supposed to be ASCII encoded strings anyway. But yes, the dialect object would be the place to encode these things for PY2.

Anaphory commented 5 years ago

Or is the dialect immutable?

If the dialect was immutable, I could not actually stuff an encoding into it :)

It should really just be converting some u',' to ',' in Py2 or something very similar, but I was not sure yet why it happened in the first place and where the best place to fix in in the most canonical way would be.

xrotwang commented 5 years ago

AFAICT this PR can be closed. @Anaphory agree?