tbeu / matio

MATLAB MAT File I/O Library
https://matio.sourceforge.io
BSD 2-Clause "Simplified" License
335 stars 97 forks source link

UTF-16 support #34

Closed bavay closed 8 years ago

bavay commented 8 years ago

Currently, matio does not support reading UTF16 strings. Unfortunately, decently recent Windows computer use UTF16 and when Matlab writes a .mat file, it will default to UTF16 for strings containing non-ascii characters.

This means that with the current version of matio, trying to read a .mat file created by Matlab on Windows that contains non-ascii characters (like the '°' for temperatures or like some German accentuated letters) will write lots of error messages ("Character data not supported type") as well as fail to read the offending strings (which becomes even more problematic when these should contain the units). As I am stuck with some files with such properties (they come from an operational weather forecast toolchain and should be forwarded into another operational toolchain, so I can absolutely not ask for anything to be changed), my only hope is to have some support implemented in matio.

As I see it, they are a few options:

Of course, you might have other ideas or prefer some other options!

tbeu commented 8 years ago

Can you please attach a MAT-file - created by MATLAB - with UTF16 char array. Thanks.

bavay commented 8 years ago

I had to zip it in order to attach it... test-UTF16.zip

tbeu commented 8 years ago

Indeed, both function ReadCharData and ReadCompressedCharData miss the case for MAT_T_UTF16.

Can you check if 12b7d40 solves already the problem.

bavay commented 8 years ago

This is half way better... The string is read without error message but the non-ascii characters remain messed up (and I have no clue what these are encoded into). for example, for the units: '�C'

tbeu commented 8 years ago

Should be little endian Unicode on Win (which is little endian). Thus, it should be exactly what you want.

From matfile_format.pdf:

The UTF-16 and UTF-32 encodings are in the byte order specified by the Endian Indicator. UTF-8 is byte order neutral.

tbeu commented 8 years ago

If you apply e0d8f44 to matdump then matdump -d test-UTF16.mat stat.dunit > dunit.txt correctly shows °C in Latin1 encoding in dunit.txt. It's really strange.

Or try test_mat readvar test-UTF16.mat stat > stat.txt with same result for the dunit field.

bavay commented 8 years ago

Ok, now this is clear... I'm sorry, my terminal was configured as UTF-8, therefore it was messing up the Latin1 encoding. So, your commit does works, it properly reads UTF-16 and outputs ISO-8859-1. Thanks a lot for your very quick reply and commit!

tbeu commented 8 years ago

But what still is strange, that the UTF-16 string is not Unicode encoded but ISO-8859-1. MATLAB simply is not consistent on this, see e.g. http://blog.omega-prime.co.uk/?p=150.

tbeu commented 8 years ago

I'd like to see some MAT-file where the UTF-16 character array does not reduce to 8-bit encoding.