roskakori / cutplace

validate data stored in CSV, PRN, ODS or Excel files
http://cutplace.readthedocs.org/
GNU Lesser General Public License v3.0
18 stars 20 forks source link

Opening a data file in UTF-16 with rows containing the € symbol results in a ValidationError #112

Closed michele-deus closed 8 years ago

michele-deus commented 8 years ago

An example error is (it is a cut and paste from the html output of a django view):

Validation Error: (R2875C1): cannot write data row: 'ascii' codec can't encode character u'\u20ac' in position 35: ordinal not in range(128); row=[u'B004', u'IDX', u'', u'HFRI Equity Index - NET (\u20ac)', u'BMK', u'Benchmark', u'P', u'Performance', u'AZ', u'20151230', u'1465.50016373', u'', u'EUR']

The code for opening the file is: datafile = io.open(filename, mode='r', encoding=encoding, newline=line_delimiter )

encoding is taken from the CID and is UTF-16. To validate I write row-per-row with a cutplace.Writer to a BytesIO. For every row I do:

for line in datafile: line = line.strip(line_delimiter) raw_row = line.split(delimiter) row = [] for e in raw_row: row.append( e.strip() ) try: writer.write_row(row) except cutplace.errors.CutplaceError, err: log.append( ("at line %s: Validation Error: %s" % (row_nr, err), "(at line %s): %s" % (row_nr, line)) )

roskakori commented 8 years ago

@michele-deus : What happens if you write to io.StringIO instead of io.BytesIO? Does it raise the same error?

michele-deus commented 8 years ago

I was writing to cStringIO, now writing to BytesIO says: 'unicode' does not have the buffer interface

Writing to io.StringIO works. So the problem must be related to cStringIO.

roskakori commented 8 years ago

Yes, in my experience, any StringIO that is not io.StringIO is a recipe for disaster.

I take it you immediate issue is solved?

Anyway, if I understand your use case correctly, you are using cutplace to validate data that are already in a nice and cosy Python list. You are validated writing to a StringIO just because the API does not yet provide any sensible way to validate in-memory data and insists on a file like object.

Am I correct? If so, it might be worthwhile to add some functionality to validate data outside of file like objects.

michele-deus commented 8 years ago

Yep I resolved my issue. In fact it's a file I'm reading and writing in the io.StringIO. I'm doing it this way because I prefer to do it line-by-line instead of validating the whole file, but probably this is my little knowledge of your library.

roskakori commented 8 years ago

Thanks for the confirmation, closing this.

If you just want to validate and read a file line by line, use cutplace.rows, for example:

import cutplace

for row in cutplace.rows('some_cid.ods', 'some_data.csv'):
    pass  # ...or do something with `row`.