thenineteen / Semiology-Visualisation-Tool

Data driven 3D brain visualisation of semiology. Semiology to anatomy translator based on over 4600 patients from 309 peer-reviewed articles.
MIT License
9 stars 6 forks source link

UnicodeDecodeError: in position 2803: ordinal not in range(128) #185

Closed fepegar closed 4 years ago

fepegar commented 4 years ago

The characters with accents in déjà vu and déjà vécu are causing this error in the Slicer module. Can we remove the accents?

thenineteen commented 4 years ago

yes. I seem to remember removing all these accents from the database beta version, but as I was not the only one entering data these accents re-occurred in the data and then were added to the SemioDict YAML file which causes this error.

Q) What's the best way to resolve this?

  1. Remove all accents from database AND SemioDict YAML
  2. somehow add support for these characters?
fepegar commented 4 years ago

I removing the accents everywhere is easiest. Especially because people are unlikely to type them when they search for custom terms.

thenineteen commented 4 years ago

@neurokleos @thenineteen

thenineteen commented 4 years ago

The characters with accents in déjà vu and déjà vécu are causing this error in the Slicer module. Can we remove the accents?

I assumed this happens with Psychic semiology but I can't get the error or the loggings in 3D Slicer to give me this message at the moment. Please tell me how to reproduce this error so I can go about fixing and ensuring it is fixed. @fepegar

fepegar commented 4 years ago

I get this as soon as I open the module on 3D Slicer. Maybe it's because I'm using the latest Slicer version.

Python 3.6.7 (default, Aug 18 2020, 23:07:01) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
>>> 
Loading Slicer RC file [/Users/fernando/.slicerrc.py]
Slicer RC file loaded [27/08/2020 09:07:36]
Traceback (most recent call last):
  File "/Users/fernando/git/Semiology-Visualisation-Tool/slicer/SemiologyVisualisation.py", line 89, in setup
    self.logic.installRepository()
  File "/Users/fernando/git/Semiology-Visualisation-Tool/slicer/SemiologyVisualisation.py", line 961, in installRepository
    import mega_analysis
  File "/Users/fernando/git/Semiology-Visualisation-Tool/mega_analysis/__init__.py", line 3, in <module>
    from .crosstab.mega_analysis.custom_semiology_SemioDict_lookup import (
  File "/Users/fernando/git/Semiology-Visualisation-Tool/mega_analysis/crosstab/mega_analysis/custom_semiology_SemioDict_lookup.py", line 11, in <module>
    SemioDict = yaml.load(f, Loader=yaml.FullLoader)
  File "/Applications/Slicer.app/Contents/lib/Python/lib/python3.6/site-packages/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "/Applications/Slicer.app/Contents/lib/Python/lib/python3.6/site-packages/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "/Applications/Slicer.app/Contents/lib/Python/lib/python3.6/site-packages/yaml/reader.py", line 85, in __init__
    self.determine_encoding()
  File "/Applications/Slicer.app/Contents/lib/Python/lib/python3.6/site-packages/yaml/reader.py", line 124, in determine_encoding
    self.update_raw()
  File "/Applications/Slicer.app/Contents/lib/Python/lib/python3.6/site-packages/yaml/reader.py", line 178, in update_raw
    data = self.stream.read(size)
  File "/Applications/Slicer.app/Contents/bin/../lib/Python/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2803: ordinal not in range(128)
fepegar commented 4 years ago

Maybe utf-8 encoding can be used to read the YAML in SemioDict = yaml.load(f, Loader=yaml.FullLoader). But, as I said, it might be more user friendly to remove the accents as most users won't even know how to type accents in their keyboards anyway.

thenineteen commented 4 years ago

The reason the accents are there is to match the use of these same terms in publications in the database. Not expecting users to use the accents.

If there is a way to accurately encode these characters and match the same characters in the database without altering both the data and the dictionary, then this would be ideal.

My knowhow of UTF8 or others isn't enough to figure this out.

Are you saying if we changed it to full loader utf8, this should work?

fepegar commented 4 years ago

If there is a way to accurately encode these characters and match the same characters in the database without altering both the data and the dictionary, then this would be ideal. My knowhow of UTF8 or others isn't enough to figure this out. Are you saying if we changed it to full loader utf8, this should work?

I think so. But if the user searches for deja vu, it won't match déjà vu. Although I guess you could modify the corresponding function so that it does.

thenineteen commented 4 years ago

The idea is that both with and without accent versoins of deja vu will be included in the SemioDict YAML for the predefined semiology list as it is now, but without giving these errors you've reported. I'll try utf-8

thenineteen commented 4 years ago
  1. YAML

https://stackoverflow.com/questions/58340498/reading-yaml-file-in-python-with-accents-and-special-charactets

  1. Excel - doesn't support accented characters. Will need to save as CSV then reconfigure all the localisation columns settings http://www.accompa.com/kb/answer.html?answer_id=262
thenineteen commented 4 years ago

@fepegar I still can't reporoduce this error in slicer or on vs code.

Please can you send a screenshot of the error and which semiology you used (presumably Psychic?)

Would be also helpful if I can have your versions of YAML and pandas?

thenineteen commented 4 years ago

replace all é with e: 65 replacements in Semio2Brain Database replace all à with a: 41 replacements in Semio2Brain Database also did the same for the two entries in SemioDict

see commit below