psychoinformatics-de / datalad-tabby

DataLad extension package for the "tabby" dataset metadata specification
Other
1 stars 5 forks source link

Load-tabby is locale-dependent w.r.t file encoding #112

Open mslw opened 1 year ago

mslw commented 1 year ago

load_tabby (more specifically, TabbyLoader) reads a file using:

with src.open(newline='') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')

This means that the file is read using system default encoding (ref). This can cause problems e.g. when reading a ISO-8859-1-encoded file (presumably generated on Windows) on a Linux machine (where UTF-8 is the most likely).

Reproducing: I encountered the problem when loading a dataset tabby file that contained a phrase "2 µg/gram of weight" in its description, and was saved in ISO-8859-1 encoding (as reported by file -i). Loading crashed with:

  File "/home/mszczepanik/Documents/datalad-tabby/datalad_tabby/io/load.py", line 101, in _load_single
    for row_id, row in enumerate(reader):
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 298: invalid start byte

This happened in a data submission / data curation context. Personally, I don't mind treating this as a user error - either saving in ISO-8859-1 to begin with, or not checking the file encoding before proceeding (in the end, I converted the file with iconv).

I am not sure what would be the fix here, if any. Adding encoding to the loader, and exposing it through the API would complicate loading and still require me to check the encoding upfront. Guesswork with things like chardet or libmagic might be possible, but as far as I understand can't be perfect either.