Add unit test covering Unicode

sbesson commented 4 years ago

b45e95c exposes a Python 3.6 regression when adding a StringColumn containing Unicode. The same scenario passes without issue on Python 2

sbesson commented 4 years ago

The numpy.dtype note about using strings in Python 3 is probably relevant to the root of this problem. Unfortunately, local attempts to migrateStringColumn.dtypes() from S to U have been unsuccessful.

Earlier demo on IDR upgraded to an experimental Python 3 environment seems to suggest that the reading of StringColumn created on Python 2 with Unicode characters is unaffected:

Screen Shot 2019-12-02 at 16 19 22

I expect I will not be in capacity to provide a fix for this regression for the OMERO 5.6.0. There is a question of whether this should be marked as a blocker for GA, it is certainly one for the upgrade of IDR to Python 3 as it breaks the annotation workflows if CSV files contains Unicode characters.

As immediate next steps, proposing to:

potentially extract the Unicode bit of this test, mark it as xfail
apply the same approach to the integration test in ome/openmicroscopy#6189
get this merged and capture the regression as an issue issue to be reviewed for OMERO 5.6.0

Alternate thoughts or suggestions welcome /cc @joshmoore @jburel @manics

joshmoore commented 4 years ago

Something I haven't really considered yet: would a UnicodeColumn be of use?

manics commented 4 years ago

What would be the difference between a StringColumn with unicode and a UnicodeColumn?

joshmoore commented 4 years ago

It would be a location that could different read/write logic if that would help.

sbesson commented 4 years ago

Superseded by #143

ome / omero-py

Add unit test covering Unicode #133