Adding Support for different CSV Encodings in Import_Scripts/Populate_Metadata.py

JulianHn commented 2 years ago

This draft will add support for csv encodings differing from utf-8 to populate_metadata.py. This is relevant e.g. when using the Populate_Metadata.py script distributed with OMERO, when using a CSV File exported from Excel with default settings, which are cp1252 for US and iso-8859-1 for EU system locales.

It requires merging of https://github.com/ome/omero-py/pull/325 in omero-py, to add support for this in the imported omero.utils.populate_roi script. If this gets merged, support for different encoding can happen by simply adding a new string input field to the script_params, that will contain the file encoding and default to utf-8 (i.e. legacy behaviour).

Tests for cp1252 and iso-8859-1, the default encodings for US and EU Excel CSV exports have been added as well.

// Julian

will-moore commented 2 years ago

Thanks for this contribution. I'm just wondering if there's a reliable way to make this easier for users, so that they don't have to know what encoding to choose (and to maybe give them a nice error message if they need to choose a different encoding). I've not tried this yet, so I haven't looked to see where it fails if you use the wrong encoding. But if the script can check whether the correct encoding was used immediately after reading the data, and then give the user a message (instead of failing later) that would be nice. The default behaviour (if a user doesn't pick an encoding) could even be to try a handful of the most common encodings, in order of prevalence (e.g. utf8, latin-1, cp1252...) and use the first one that "works"?

Does this sound feasible?

JulianHn commented 2 years ago

Hey @will-moore,

there are certainly options to do that by importing chardet that has function for guessing the encoding. This obviously is not 100% guaranteed, but in my experience works well for the most common encodings.

The problem with trying encodings is, that with the exception of .decode("utf-8") I can't get the other encodings to reliably throw errors when the wrong encoding is used. They will just return nonsensical strings. If this sounds reasonable I can certainly add this to the modified DownloadingOriginalFileProvider and amend #325 over in omero-py

will-moore commented 2 years ago

I wasn't aware of chardet - looks handy. I'm not sure we'd want to make it a dependency of omero-py and force all users to install it? (although it doesn't seem to have any dependencies) cc @joshmoore @chris-allan If we don't want to make it required, then we could put the import in a try/except.

One option for users to choose "auto-detect" encoding would be to use encoding=None in: def get_original_file_data(self, original_file, encoding="utf-8"): and then if chardet hasn't been imported, raise an Exception?

JulianHn commented 2 years ago

We could also check in populate_metadata-py if the module is available by try/except importing chardet and modifying the Description displayed when opening the script in OMERO.web/Insight, as done with the outdated metadata plugin detection already. I guess that would make it even easier for users that are not familiar with reading the stderr/stdout outputs?

So it would be something like:

Check if chardet is avalaible. If yes, set the encoding field to "None"/"Detect" etc... If Not, set the encoding field to "utf-8" (to ensure that the behaviour at the moment is replicated) and display a warning in the description, that for non utf-8 encoded files an encoding as to be explicitly specified or chardet needs to be installed?

will-moore commented 2 years ago

That sounds like a good solution to me. But best to hear any vetos from Josh or Chris before committing (and maybe Seb - who is away till Thursday).

chris-allan commented 2 years ago

Detecting encodings is notoriously error prone and time consuming. ^1 I'm not adverse to adding chardet as a dependency but it has to be used intelligently and sparingly in order to not cause more problems than doing no detection at all. Selecting an incorrect encoding and inserting subtly corrupt data into OMERO is worse than failing spectacularly early. At a minimum I think an implementation would not default to using chardet but rather add it as an option and select a reasonable number, possibly configurable, of rows to apply it to rather than the whole file.

JulianHn commented 2 years ago

Hey @chris-allan ,

I agree with using it sparingly and probably even making it optional in the first place, as well as selecting a sensible number of rows.

The problem with "Selecting an incorrect encoding and inserting subtly corrupt data into OMERO is worse than failing spectacularly early" is that there is AFAICS no way to make wrong encoding imports fail spectacularly reliably. UTF-8 decoding might fail if by chance a byte combination not in the codepage is used, decoding with latin-1 or cp1252 will always "work" (but might be non-sense), since they are not throwing errors. So the problem with importing subtly corrupt data into OMERO exists anyway already and we guessing the encoding might be better than the current situation at least. An option for letting those fail at least in some cases might be to look for bytes that are not in the codepage (e.g. 7F-9F in latin-1) or check for the occurence of some of the control bytes in (00 to 1F) to that should not be used in "normal" CSV files. But of course this also won't catch all cases.

chris-allan commented 2 years ago

So the problem with importing subtly corrupt data into OMERO exists anyway already and we guessing the encoding might be better than the current situation at least.

Definitely not disputing that. Guessing encodings and having any assumed default is fraught with all sorts of problems. My thinking has always been that defaulting UTF-8 is going to fail spectacularly in the most incorrect scenarios. Whether those scenarios overlap with your use cases and make sense to you is a dice roll.

There are so many crazy use cases dealing with delimited text input containing no authoritative statement of encoding that we've seen that don't just differ because of locale. Excel on Windows and Excel on macOS can behave very, very differently in how they handle export to CSV/TSV and don't get me started on byte order marks ^1, line endings or quotation.

What I very much do not want to do here is create a scenario where we normalize encoding autodetection for users. Especially if it has a high likelihood of "fallback" to latin1 / ISO-8859-1 or CP-1252 because the first n rows of the CSV happen to not contain an μ but a following row does. That would be a disaster for us. I'm not experienced enough with chardet to comment on the likelihood of such a scenario. It also cannot create performance problems for analytical use cases, we regularly deal with CSV input containing millions of rows as well as rows that contain several kilobytes of text.

chris-allan commented 2 years ago

60 second attempt testing chardet for an example of the type of scientific input we all regularly see:

>>> import chardet
>>> a = 'nucleus is 54μm in diameter'.encode('utf-8')
>>> chardet.detect(a)
{'encoding': 'ISO-8859-1', 'language': '', 'confidence': 0.73}
>>> a.decode('iso-8859-1')
'nucleus is 54Î¼m in diameter'
>>> a.decode('utf-8')
'nucleus is 54μm in diameter'

That's IMHO just simply an unacceptable outcome. Could certainly be my naïve use or maybe chardet is just not the right solution for scientific data.

JulianHn commented 2 years ago

That's IMHO just simply an unacceptable outcome. Could certainly be my naïve use or maybe chardet is just not the right solution for scientific data.

Ugh ... I agree. That is really unacceptable. Usually chardet gets better the more data it gets (since it's based on some language heuristics), but especially for more "technical" metadata that does not contain significant "natural language" parts, it probably is not guaranteed to find a lot to work with.

Okay, so to summarize my understanding of your points and the testing results: Supporting multiple encodings is good, but it is probably better to force the user to specify the encoding (while keeping the default to utf-8) since either we run into performance problems by having to scan millions of rows (which indeed takes quite a while with chardet in my experience) and probably still can't ensure correct guessing in >95% of the cases or risk even more problems with "corrupted" data,

If we can agree to this, I would just add some error catchings for the UnicodeDecodeError as well as at least checking latin-1 for bytes not in the codepage to let the script fail with some nice Errors.

chris-allan commented 2 years ago

Supporting multiple encodings is good, but it is probably better to force the user to specify the encoding (while keeping the default to utf-8) since either we run into performance problems by having to scan millions of rows (which indeed takes quite a while with chardet in my experience) and probably still can't ensure correct guessing in >95% of the cases or risk even more problems with "corrupted" data,

:+1:

JulianHn commented 2 years ago

So ... as this got kind of buried in my to-do list and we needed to figure out the details of open-source contribution licensing in the institute with a long delay finally the script with support for different csv encodings.

As discussed we require the user to specify the encoding and do not rely on any detection mechanisms. The script automatically queries the omero.util.populate_roi.DownloadingOriginalFileProvider.get_original_file_data() function for support for encodings. If there is no support, the script will look like before, with a message added as already done for the omero-metadata plugin check. If support is enabled, an additional field is added to the script that allows the user to select an encoding, while still defaulting to utf-8. The list of encodings is automatically queried from the list of encodings installed on the server. I've also added a try-catch around the respective part in the code, which adds a hopefully more clear error message to the user that might help them to find the problem in this case more quickly. This will still not catch all problems, (e.g. latin-1 is designed in a way, that it will always decode without an error, even if there are symbols not in the codepage), but this is a design choice and AFAICS not solvable from our side.

Furthermore, the test now should check for all encodings and either expects an successful import or at least a clean exit. I cannot figure out how to set up a correct testing environment for the integration tests, so this code is only tested manually, not using pytest.

Accompanying changes have also been made in https://github.com/ome/omero-py/pull/325.

// Julian

JulianHn commented 2 years ago

I'm afraid I cannot figure out why the checks are failing from the Github Action log. Any ideas what needs fixing?

will-moore commented 2 years ago

@JulianHn I'm afraid that the flake8 checks somehow got missed when https://github.com/ome/omero-scripts/pull/193 was merged (adding a bunch of new scripts) and they are now failing in other PRs.

I think they are all fixed in https://github.com/ome/omero-scripts/pull/195 which is awaiting testing/merge. I could fix them again in a separate PR: it would just mean subsequently fixing all the merge conflicts with that branch!

JulianHn commented 2 years ago

@will-moore Ah ... That explains it, thanks for the heads-up. No worries about fixing them separately, I just did not assume that the action would outright fail because of flake8 warnings and was confused what causes the fail.

will-moore commented 2 years ago

Now that #195 is merged, if you can merge origin/develop branch into this, that should help get the build green.

ome / omero-scripts

Adding Support for different CSV Encodings in Import_Scripts/Populate_Metadata.py #198