mne-tools / mne-python

MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python
https://mne.tools
BSD 3-Clause "New" or "Revised" License
2.7k stars 1.31k forks source link

BUG: Error in handling fiff tags with "bad" text encoding #1026

Closed jkauramaki closed 10 years ago

jkauramaki commented 10 years ago

It sees that there is a small problem in decoding experimenter name tag from raw fiff files at least with the latest mne-python, as I remember succeeding some months ago with files from same dataset (with latest mne-python back then). It seems that my full name with the scandinavian letter "ä" has been used in user account creation for the MEG acquisition computer (personally I would have simply used "a" but I guess the admin had a Finnish keyboard). This letter, however, seems to be stored in non-utf-8 format (ISO-8859-1/Latin-1 is my best guess, or simply broken encoding). The end results is now that raw fiff file loading fails.

Small code change in fiff/tag.py (line 347) to simply

tag.data = str(tag.data.tostring().decode(encoding='UTF-8'))

is not enough, as it results an error "UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 13: invalid continuation byte". Similar error comes up with forcing 'ISO-8859-1' encoding, so I guess the encoding is simply broken. However, for initial workaround solution, changing the line to e.g.

tag.data = str(tag.data.tostring().decode(encoding='UTF-8',errors='ignore'))

seems to work fine.

And yes I know could attempt the change the relevant code for good (if that minor change shows no side effects), but unfortunately I'm still learning the basics of python (i.e. only running slightly modified MNE-python code examples) AND github (i.e., created an account just for this :) and I have no idea what kind of encoding the raw fiff file should use in string tags..

Full traceback in case of problematic file

In [132]: raw = fiff.Raw(raw_fname)
Opening raw data file /datadir/megraw_mc/fs1a_raw_mc_tsss.fif...
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/datadir/results/tmp/<ipython-input-132-233f57df4bce> in <module>()
----> 1 raw = fiff.Raw(raw_fname)

/virtualenv/mne_env1/src/mne/mne/fiff/raw.py in __init__(self, fnames, allow_maxshield, preload, proj, compensation, add_eeg_ref, verbose) 
/virtualenv/mne_env1/src/mne/mne/utils.py in verbose(function, *args, **kwargs)
    385         return ret
    386     else:
--> 387         ret = function(*args, **kwargs)
    388         return ret
    389

/virtualenv/mne_env1/src/mne/mne/fiff/raw.py in __init__(self, fnames, allow_maxshield, preload, proj, compensation, add_eeg_ref, verbose)  
     94
     95         raws = [self._read_raw_file(fname, allow_maxshield, preload,
---> 96                                     compensation) for fname in fnames]
     97         _check_raw_compatibility(raws)
     98

/virtualenv/mne_env1/src/mne/mne/fiff/raw.py in _read_raw_file(self, fname, allow_maxshield, preload, compensation, verbose)

/virtualenv/mne_env1/src/mne/mne/utils.py in verbose(function, *args, **kwargs)
    385         return ret
    386     else:
--> 387         ret = function(*args, **kwargs)
    388         return ret
    389

/virtualenv/mne_env1/src/mne/mne/fiff/raw.py in _read_raw_file(self, fname, allow_maxshield, preload, compensation, verbose)
    180
    181         #   Read the measurement info

--> 182         info, meas = read_meas_info(fid, tree)
    183
    184         #   Locate the data of interest

/virtualenv/mne_env1/src/mne/mne/fiff/meas_info.py in read_meas_info(fid, tree, verbose)

/virtualenv/mne_env1/src/mne/mne/utils.py in verbose(function, *args, **kwargs)
    385         return ret
    386     else:
--> 387         ret = function(*args, **kwargs)
    388         return ret
    389

/virtualenv/mne_env1/src/mne/mne/fiff/meas_info.py in read_meas_info(fid, tree, verbose)
    246                 ctf_head_t = cand
    247         elif kind == FIFF.FIFF_EXPERIMENTER:
--> 248             tag = read_tag(fid, pos)
    249             experimenter = tag.data
    250         elif kind == FIFF.FIFF_DESCRIPTION:

/virtualenv/mne_env1/src/mne/mne/fiff/tag.py in read_tag(fid, pos, shape, rlims)
    345                                             shape=shape, rlims=rlims)
    346                 # Use unicode or bytes depending on Py2/3

--> 347                 tag.data = str(tag.data.tostring().decode())
    348             elif tag.type == FIFF.FIFFT_DAU_PACK16:
    349                 tag.data = _fromstring_rows(fid, tag.size, dtype=">i2",

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 13: ordinal not in range(128)
dengemann commented 10 years ago

Thanks for reporting @jkauramaki I think it should be straight forward to add a test that adds UTF-8 names on runtime and saves a temp file to reproduce and ultimately tackle this issue.

larsoner commented 10 years ago

I can tackle this next week sometime if nobody else wants to.

dengemann commented 10 years ago

I wrote a test locally which basically reproduces reading and writing errors related to unicode.

https://github.com/dengemann/mne-python/commit/7a48e7422518edbd61b11eaf116c92c38d4dbb56

produces:

Overwriting existing file.
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-3-662bcbb403de> in <module>()
----> 1 test_raw.test_io_raw()

/Users/denisaengemann/anaconda/lib/python2.7/site-packages/mne/fiff/tests/test_raw.py in test_io_raw()
    360     raw.info['description'] = text_type('äöé')
    361     temp_file = op.join(tempdir, 'raw.fif')
--> 362     raw.save(temp_file, overwrite=True)
    363     raw = Raw(tmp_file)
    364

/Users/denisaengemann/anaconda/lib/python2.7/site-packages/mne/fiff/raw.pyc in save(self, fname, picks, tmin, tmax, buffer_size_sec, drop_small_buffer, proj, format, overwrite, verbose)

/Users/denisaengemann/anaconda/lib/python2.7/site-packages/mne/utils.pyc in verbose(function, *args, **kwargs)
    385         return ret
    386     else:
--> 387         ret = function(*args, **kwargs)
    388         return ret
    389

/Users/denisaengemann/anaconda/lib/python2.7/site-packages/mne/fiff/raw.pyc in save(self, fname, picks, tmin, tmax, buffer_size_sec, drop_small_buffer, proj, format, overwrite, verbose)
    986
    987         outfid, cals = start_writing_raw(fname, info, picks, type_dict[format],
--> 988                                          reset_range=reset_dict[format])
    989         #
    990         #   Set up the reading parameters

/Users/denisaengemann/anaconda/lib/python2.7/site-packages/mne/fiff/raw.pyc in start_writing_raw(name, info, sel, data_type, reset_range)
   1908         cals.append(info['chs'][k]['cal'] * info['chs'][k]['range'])
   1909
-> 1910     write_meas_info(fid, info, data_type=data_type, reset_range=reset_range)
   1911
   1912     #

/Users/denisaengemann/anaconda/lib/python2.7/site-packages/mne/fiff/meas_info.pyc in write_meas_info(fid, info, data_type, reset_range)
    503         write_string(fid, FIFF.FIFF_EXPERIMENTER, info['experimenter'])
    504     if info.get('description') is not None:
--> 505         write_string(fid, FIFF.FIFF_DESCRIPTION, info['description'])
    506     if info.get('proj_id') is not None:
    507         write_int(fid, FIFF.FIFF_PROJ_ID, info['proj_id'])

/Users/denisaengemann/anaconda/lib/python2.7/site-packages/mne/fiff/write.pyc in write_string(fid, kind, data)
     74     """Writes a string tag"""
     75     data_size = 1
---> 76     _write(fid, str(data), kind, data_size, FIFF.FIFFT_STRING, '>c')
     77
     78

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

I know this writing not reading but the culprit should be the same.

@Eric89GXL we should think about a central fix which tackles this issue for all io related functions.

I guess we will see things like this wherever we write to fiff files...

larsoner commented 10 years ago

Yeah. We had to tweak the writing a bit for Python3 support, which would explain why this issue didn't exist before, but does now. Python3 required explicitly encoding/decoding to do bytes<->string conversions, I assume the problem is with how we did that.