ome / omero-py

Python project containing Ice remoting code for OMERO
https://www.openmicroscopy.org/omero
GNU General Public License v2.0
21 stars 32 forks source link

Populate metadata should support other encondings than utf-8 #323

Open dominikl opened 2 years ago

dominikl commented 2 years ago

populate_metadata.py assumes that the csv files are encoded with utf-8. It fails if that's not the case. Maybe there should an option to specify the encoding.

See https://forum.image.sc/t/populate-metadata-py-and-non-utf-8-csvs/64595

dominikl commented 2 years ago

After a quick glance, I can't really see how populate_metadata.py is affected, as the respective code which is suspected to cause the issue is in populate_roi.py.

imagesc-bot commented 2 years ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/populate-metadata-py-and-non-utf-8-csvs/64595/2

will-moore commented 2 years ago

@dominikl It's maybe not the most natural place for the code to live, but Populate_Metadata.py does import it:

https://github.com/ome/omero-scripts/blob/14c830099efe1f0d6b32a2b3914febd8ddcd89ea/omero/import_scripts/Populate_Metadata.py#L33

JulianHn commented 2 years ago

@dominikl : Thanks for opening the issue again. Here is the diff of my modified populate_metadata.py that defines an own FileProvider Class to overwrite the behaviour imported from omero.util.populate_roi. At the moment it is fixed to latin-1 encoding after it detects a UnicodeDecode Error, but it could of course easily be modified to an arbitrary encoding provided by the user. Futhermore, the logic for truncating the tempfile had to be adjusted, since the size of the original file is obviously no longer the same as that of the new file, in case the encoding is not utf-8.

I'm not sure if it makes sense to redefine this within the metadata script or if it would make sense to make this option available directly within omero.util.populate_roi.DownloadingOriginalFileProvider

5,6d4
< 
< 
15d12
< 
19d15
< 
21d16
< 
30c25,26
< 
---
> import tempfile
> from past.utils import old_div
55d50
< 
60a56,84
> class OwnFileProvider(DownloadingOriginalFileProvider):
>     
>     def get_original_file_data(self, original_file):
>         """
>         Downloads an original file to a temporary file and returns an open
>         file handle to that temporary file seeked to zero. The caller is
>         responsible for closing the temporary file.
>         """
> 
>         
>         self.raw_file_store.setFileId(original_file.id.val)
>         temporary_file = tempfile.NamedTemporaryFile(mode='rt+',
>                                                      dir=str(self.dir),
>                                                      encoding="utf-8-sig")
>         size = original_file.size.val
>         size_new = 0
>         for i in range((old_div(size, self.BUFFER_SIZE)) + 1):
>             index = i * self.BUFFER_SIZE
>             data = self.raw_file_store.read(index, self.BUFFER_SIZE)
>             try:
>                 data_write = data.decode("utf-8").rstrip('\0')
>                 size_new += len(data_write.encode("utf-8-sig"))
>             except UnicodeDecodeError:
>                 data_write = data.decode("latin-1").rstrip('\0')
>                 size_new += len(data_write.encode("utf-8-sig"))
>             temporary_file.write(data_write)            
>         temporary_file.seek(0)
>         temporary_file.truncate(size_new)
>         return temporary_file
122c146
<     provider = DownloadingOriginalFileProvider(conn)
---
>     provider = OwnFileProvider(conn)
198a223,224
> 
>