utf-8' codec can't decode byte 0xdc in position [...] json.loads [...] json\__init__.py s = s.decode(detect_encoding(s), 'surrogatepass')

micro-manager / pycro-manager

Python control of micro-manager for customized data acquisition

https://pycro-manager.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

165 stars 51 forks source link

utf-8' codec can't decode byte 0xdc in position [...] json.loads [...] json\init.py s = s.decode(detect_encoding(s), 'surrogatepass') #467

Closed CristhianPerdomo closed 1 year ago

CristhianPerdomo commented 1 year ago

Bug report

Bug summary

First of all, thanks a lot for everything you have done with pycromanager, it has been a powerful tool that has automated some processes in the lab!

I have been running pycromanager on windows 7 machines for a while because of its compatibility with some of our apparatus. Everything works well in Win7; However, we have recently updated one of our pcs to Windows 10, and when I try to run pycromanager there, with a simple function of acquisition, it gives me the following error:

from pycromanager import Core
from pycromanager import Acquisition, multi_d_acquisition_events, start_headless
import os
import sys

save_dir = r'C:/Users/SPIM3/Documents/cod'

with Acquisition(directory=save_dir, name=r"acqStack", show_display=False) as acq: 
    events = multi_d_acquisition_events(num_time_points=5,z_start=1000, z_end=6000, z_step=400,)
    acq.acquire(events)

Expected outcome

Acquisition of stacks

Actual outcome

PS C:\Users\SPIM3\Documents\cod> & C:/Users/SPIM3/AppData/Local/Programs/Python/Python311/python.exe c:/Users/SPIM3/Documents/cod/test.py utf-8 Traceback (most recent call last): File "c:\Users\SPIM3\Documents\cod\test.py", line 43, in with Acquisition(directory=save_dir, name=r"acqStack", show_display=False) as acq: File "C:\Users\SPIM3\AppData\Local\Programs\Python\Python311\Lib\site-packages\pycromanager\acquisitions.py", line 440, in exit self.await_completion() File "C:\Users\SPIM3\AppData\Local\Programs\Python\Python311\Lib\site-packages\pycromanager\acquisitions.py", line 380, in await_completion self._check_for_exceptions() File "C:\Users\SPIM3\AppData\Local\Programs\Python\Python311\Lib\site-packages\pycromanager\acquisitions.py", line 452, in _check_for_exceptions raise self._exception File "C:\Users\SPIM3\AppData\Local\Programs\Python\Python311\Lib\site-packages\pycromanager\acquisitions.py", line 194, in _storage_monitor_fn axes = dataset._add_index_entry(index_entry) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\SPIM3\AppData\Local\Programs\Python\Python311\Lib\site-packages\ndtiff\nd_tiff_current.py", line 401, in _add_index_entry self._read_channel_names() File "C:\Users\SPIM3\AppData\Local\Programs\Python\Python311\Lib\site-packages\ndtiff\nd_tiff_current.py", line 422, in _read_channel_names channel_name = self.read_metadata(**axes)["Channel"] ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\SPIM3\AppData\Local\Programs\Python\Python311\Lib\site-packages\ndtiff\nd_tiff_current.py", line 370, in read_metadata return self._do_read_metadata(axes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\SPIM3\AppData\Local\Programs\Python\Python311\Lib\site-packages\ndtiff\nd_tiff_current.py", line 573, in _do_read_metadata return reader.read_metadata(index) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\SPIM3\AppData\Local\Programs\Python\Python311\Lib\site-packages\ndtiff\nd_tiff_current.py", line 88, in read_metadata return json.loads( ^^^^^^^^^^^ File "C:\Users\SPIM3\AppData\Local\Programs\Python\Python311\Lib\json__init__.py", line 341, in loads s = s.decode(detect_encoding(s), 'surrogatepass') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 588: invalid continuation byte

I have tried to change the coding in VS Code, as well as using the r before de strings in order to treat them as raw strings, and other little tricks to encode/decode strings, but nothing works. I can not understand which could be wrong :/

Version Info

Operating system: Windows 10 - Spanish
pycromanager version: pycromanager 0.21
MicroManager version: 2.0.1 MMSetup_64bit_2.0.1_20221129
Python version: 3.11
Python environment: VS Code

Thanks!

henrypinkard commented 1 year ago

Glad you like it!

Seems like the problem is coming from

  self._read(
                index["metadata_offset"], index["metadata_offset"] + index["metadata_length"]
            )

Can you print it and see what it is, like so:

print(str(  self._read(
                index["metadata_offset"], index["metadata_offset"] + index["metadata_length"]
            )))

svikolev commented 1 year ago

Hi @henrypinkard and @CristhianPerdomo ,

I opened an issue about the same error. see issue 432.

My error was exactly as you describe: when python tries '_read_channel_names()' and then it tries to read_metadata, and then calls json to decode some variable 's'.

I opened this up in an IDE with a debugger and had it stop just before the last call to decode. Using the debugger I checked the variable s and found that the byte that causes the error is a plus minus symbol. However, my IDE had no problem decoding the string even with utf-8 so I was very confused by this.

I took inspiration from a similar read_metadata() function in a different file and was able to fix my issue, as described in Issue 432 by decoding using .decode("iso-8859-1") before it gets passed to json.loads(). I do not understand why this worked but it did.

Henry responded very quickly, with some explanation and a suggested test to see if my encodings work as expected on my PC and some suggested reading to investigate this problem.

The code @henrypinkard sent to test the encoding of plus minus worked perfectly fine in jupyter cell.

I regrettably, got nowhere with my digging of how encodings work and why this strange thing was happening.

Thanks, Svilen

CristhianPerdomo commented 1 year ago

Thanks for your answer @henrypinkard. The following was part of the print:

"COM4-Description":"Serial port driver (boost:asio)","Cobolt-Vendor":"H\xdcBNER Photonics".

And here is one of the problems, since HÜBNER Photonics is the company from which we bought our cobolt laser. The Ü is the character that is causing the error.

Later in the text, I have something similar:

"Internal","pco_camera-Signal 4 (Status Expos) Timing":"Show time of \'First Line\'",

And:

"Cobolt-Description":"Cobolt Controller by Karl Bellv\xe9 with contribution from Alexis Maizel".

Then, it is possible that changing the type of decoding in some of the functions gives a solution to the problems; however, I don't know exactly where to make the change. Or, I don't know what the optimal solution might be.

CristhianPerdomo commented 1 year ago

Hi @henrypinkard and @CristhianPerdomo ,

I opened an issue about the same error. see issue 432.

My error was exactly as you describe: when python tries '_read_channel_names()' and then it tries to read_metadata, and then calls json to decode some variable 's'.

I opened this up in an IDE with a debugger and had it stop just before the last call to decode. Using the debugger I checked the variable s and found that the byte that causes the error is a plus minus symbol. However, my IDE had no problem decoding the string even with utf-8 so I was very confused by this.

I took inspiration from a similar read_metadata() function in a different file and was able to fix my issue, as described in Issue 432 by decoding using .decode("iso-8859-1") before it gets passed to json.loads(). I do not understand why this worked but it did.

Henry responded very quickly, with some explanation and a suggested test to see if my encodings work as expected on my PC and some suggested reading to investigate this problem.

The code @henrypinkard sent to test the encoding of plus minus worked perfectly fine in jupyter cell.

I regrettably, got nowhere with my digging of how encodings work and why this strange thing was happening.

Thanks, Svilen

Hi @svikolev, thanks! I saw your comment a little bit late, but I will try your trick ;)

svikolev commented 1 year ago

@CristhianPerdomo After getting the latest nightly build of micro and updating pycro. I got the exact same error as you with my colibri plus minus. I fixed it by decoding in "iso-8859-1" before json loads as described before. Reiterating that I don't know why this works. I took the idea from bridge.py, acquisitions.py, data.py which all do the same kind of thing when calling json.loads()

I applied the line: .decode("iso-8859-1") to the end of line 91 in nd_tiff_current.py so now the read metadata function is:

  def read_metadata(self, index):
      return json.loads(
          self._read(
              index["metadata_offset"], index["metadata_offset"] + index["metadata_length"]
          ).decode("iso-8859-1")
      )

henrypinkard commented 1 year ago

Thanks for all the testing @svikolev and @CristhianPerdomo !

I think I finally figured it out. Encodings (ISO-8859-1, UTF-8, UTF-16, etc.) are maps from numbers to characters used to convert byte values to text.

Metadata is saved to disk as a string (of JSON), and when saving some encoding has to be applied to the string in order to convert its characters to bytes. The NDTiff specificiation says that all metadata should use the UTF8 encoding, but I noticed in the java code that writes NDTiffs that the encoding used was just the system default, not explicitly UTF8. I'm guessing it defaulted to UTF8 most of the time, except in your cases.

I made UTF8 explicit so that future datasets wont have this problem

https://github.com/micro-manager/NDTiffStorage/pull/66

And this change will be available in the new nightly builds

However, if this is right, it means the that the data you've already collected has its metadata encoded with an encoding other than UTF8. @svikolev's solution of using the decode("iso-8859-1") or maybe decode("utf-16") as well should work for this, if its present on all of your datasets.

I just updated the format version to 3.1 with this fix. And with #68 it is now possible to call dataset.minor_version and dataset.major_version to query this version

This could probably be fixed more generally in the current read_metadata function by adding a try except block and switch to alternative encodings if UTF8 fails. I opened an issue for it https://github.com/micro-manager/NDTiffStorage/issues/67. It would be a great addition if either of you is interested in making it.

CristhianPerdomo commented 1 year ago

@CristhianPerdomo After getting the latest nightly build of micro and updating pycro. I got the exact same error as you with my colibri plus minus. I fixed it by decoding in "iso-8859-1" before json loads as described before. Reiterating that I don't know why this works. I took the idea from bridge.py, acquisitions.py, data.py which all do the same kind of thing when calling json.loads()

I applied the line: .decode("iso-8859-1") to the end of line 91 in nd_tiff_current.py so now the read metadata function is:
  def read_metadata(self, index):
      return json.loads(
          self._read(
              index["metadata_offset"], index["metadata_offset"] + index["metadata_length"]
          ).decode("iso-8859-1")
      )
@svikolev, I tried this, and it worked perfectly well. Thanks for your suggestion!

CristhianPerdomo commented 1 year ago

Thanks for all the testing @svikolev and @CristhianPerdomo !

I think I finally figured it out. Encodings (ISO-8859-1, UTF-8, UTF-16, etc.) are maps from numbers to characters used to convert byte values to text.

Metadata is saved to disk as a string (of JSON), and when saving some encoding has to be applied to the string in order to convert its characters to bytes. The NDTiff specificiation says that all metadata should use the UTF8 encoding, but I noticed in the java code that writes NDTiffs that the encoding used was just the system default, not explicitly UTF8. I'm guessing it defaulted to UTF8 most of the time, except in your cases.

I made UTF8 explicit so that future datasets wont have this problem

micro-manager/NDTiffStorage#66

And this change will be available in the new nightly builds

However, if this is right, it means the that the data you've already collected has its metadata encoded with an encoding other than UTF8. @svikolev's solution of using the decode("iso-8859-1") or maybe decode("utf-16") as well should work for this, if its present on all of your datasets.

I just updated the format version to 3.1 with this fix. And with #68 it is now possible to call dataset.minor_version and dataset.major_version to query this version

This could probably be fixed more generally in the current read_metadata function by adding a try except block and switch to alternative encodings if UTF8 fails. I opened an issue for it micro-manager/NDTiffStorage#67. It would be a great addition if either of you is interested in making it.

@henrypinkard, thanks for your explanation and help. Everything you have said makes a lot of sense and gives all answers to the bug. Also, thanks for solving this issue; it is nice that the newer versions will have UTF8 encoding explicit to avoid these kinds of events. As I said above, the solution of @svikolev went well, but, as you mentioned, it should be a good idea to implement a more general solution in the read_metadata function that could have a broader spectrum; then, your invitation to us to collaborate on the function is joyously received!

Best

henrypinkard commented 1 year ago

Great! Happy to help if you need guidance

micro-manager / pycro-manager

utf-8' codec can't decode byte 0xdc in position [...] json.loads [...] json\__init__.py s = s.decode(detect_encoding(s), 'surrogatepass') #467

Bug report

utf-8' codec can't decode byte 0xdc in position [...] json.loads [...] json\init.py s = s.decode(detect_encoding(s), 'surrogatepass') #467