Guess encoding if default does not work

psychoinformatics-de / datalad-tabby

DataLad extension package for the "tabby" dataset metadata specification

Other

1 stars 5 forks source link

Guess encoding if default does not work #114

Closed mslw closed 11 months ago

mslw commented 11 months ago

If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess the encoding to use.

By default, Path.open() uses locale.getencoding() when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job.

This PR refactors tabby loader (introducing _parse_tsv_[single, many]) functions that take an optional encoding argument), which allow us to use guessed encoding (but only after the default fails). Closes #112

mslw commented 11 months ago

Updated the try/except, agree that it looks cleaner.

Unfortunately, I just got reminder that a guess is only a guess, and dataset authors table is hard, as there may be only one non-ascii character per entire file. Out of 2 files I had (both with German encoding), I got one misclassified and an (intended) "ü" misread. Maybe what I need is a manual specification (or cp1250/1252 as second priority)...

mslw commented 11 months ago

I've added the possibility to specify encoding as an argument to load_tabby. However, building on top of previously proposed logic, this led to "if encoding given, use it; else try default then guess", and the code seems fairly ugly to me.

Given my 50/50 success with guessing on the two files that prompted the PR (which I suppose was because there were too few non-ascii characters for a good guess), I now think that an explicit declaration is more useful than guessing, so I think I'll close this PR and propose a new one with just the encoding argument added.