itkach / slob

Data store for Aard 2
GNU General Public License v3.0
241 stars 32 forks source link

Converting dictCC csv fails #6

Closed Snaptags closed 9 years ago

Snaptags commented 9 years ago

I'm trying to create a slob file containing dictCC dictionary data, my "converter" looks like this:

# -*- coding: utf-8 -*-
import csv
import os
import slob
import string
import sys
with slob.create(OUTPUT_FILE) as w:
  with open(sys.argv[1], newline='') as csvfile:
    fieldnames = ['key', 'value']
    dictreader = csv.DictReader(filter(lambda row: row[0]!='#', csvfile), delimiter='\t', quotechar='"', fieldnames=fieldnames, restkey='restkey')

    for row in dictreader:
      if (row['key']):
        type = ', '.join(row['restkey']
          w.add((str(row['value']) + type).encode('utf-8'),
            row['key'], content_type=PLAIN_TEXT)

Works just fine until I try to add a line with a German Umlaut in the 'value' column, slob.py raises a ValueError exception then.

Maybe you could more complex examples to the documentation, to enable python beginners to use slob.add? :-)

Any ideas on how to solve the problem?

Snaptags commented 9 years ago

P.S. If I convert the input file to ANSI everything seems to work!

itkach commented 9 years ago

From https://docs.python.org/3.4/library/csv.html?highlight=csv#module-csv :

Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:

import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)
Snaptags commented 9 years ago

Thanks!