UnicodeDecodeError in read_byte

underchemist / nanonispy

A small library written in python 3 to parse Nanonis binary and ascii files

MIT License

25 stars 23 forks source link

UnicodeDecodeError in read_byte #3

Closed jhellerstedt closed 6 years ago

jhellerstedt commented 6 years ago

//anaconda/lib/python3.5/site-packages/nanonispy/read.py in start_byte(self) 112 for line in f: 113 # Convert from bytes to str --> 114 print(line) 115 entry = line.strip().decode() 116 if tag in entry:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 10: invalid continuation byte

Hi,

I encounter this ~semi-regularly, but can't reliably reproduce it unfortunately. Sometimes adding an 'ignore' in the decode() works, sometimes not.

jhellerstedt commented 6 years ago

ok this is happening because we're pouring Czech language nonsense into the comment box inadvertently, and nanonis will spit out ISO-8859 characters

jhellerstedt commented 6 years ago

for ii in os.listdir(os.getcwd()): 
    if ii.endswith(".dat"):
        f = open(ii, 'rb')
        file = f.read()
        try:
            file = file.decode().encode('utf-8')
        except:
            file = file.decode('latin-1').encode('utf-8')
        f.close()
        f = open(ii, 'wb')
        f.write(file)
        f.close()

Doing this before calling nanonispy.read.Spec fixes my problem, not sure if there's a smarter way to incorporate this into your read function.

underchemist commented 6 years ago

Hi,

Sorry for the lack of a reply, I've been away for quite a while. I'll take a look at fixing it more generally, do you have a sample file that reproduces it reliably? I can try to recreate it myself using what you described but if there's something you know triggers it it'd be easier.

Cheers

jhellerstedt commented 6 years ago

No worries- here's a dbox link to a spectroscopy file that throws the error: https://www.dropbox.com/s/va14wi26gplsfam/Z-Spectroscopy001.dat?dl=0

You could probably punt on this problem on the grounds that its a Specs/Nanonis issue with their software not being utf-8 compatible, but maybe there's a way to integrate the fix I mentioned above into your scheme.

underchemist commented 6 years ago

just letting you know I pushed and merge #7 what I think is a good fix for reading in and handling non utf-8 characters, let me know what you think.