tbeu / matio

MATLAB MAT File I/O Library
https://matio.sourceforge.io
BSD 2-Clause "Simplified" License
334 stars 97 forks source link

UTF8 char array wrong size #79

Closed Joker2k closed 6 years ago

Joker2k commented 6 years ago

Problem is wrong matvar->nbytes calculation when using utf8, because actual character datasize depends on utf code. So matvar->nbytes = nmemb*matvar->data_size is incorrect when using specific utf characters. It seems lib have to be massively rewritten in order to fully support utf8. That's sad.

tbeu commented 6 years ago

Unicode support was planned for v1.6.0 some 5 years ago and before Chris Hulbert resigned from development. Sorry for that.

Can you state the MATLAB code to create such a MAT-file (including the character encoding configuration).

See also #34.

tbeu commented 6 years ago

Example file generation with UTF16 encoded string

% https://stackoverflow.com/a/12417553
slCharacterEncoding('UTF-8');
str = char([1576, 1580,  1604, 1740, 10]);
save str_v7.mat str -v7
% caution: v6 does not support Unicode
% save str_v6.mat str -v6
tbeu commented 6 years ago

You might want to check https://github.com/scipy/scipy/issues/4431 and blog.omega-prime.co.uk (with its counterexample files UTF8-Counterexample.zip and UTF32-Counterexample.zip) to understand that MATLAB (tested again in R2017a) does neither support proper UTF-8 nor UTF-32. Nevertheless, I discovered and fixed (by fa95add9eeef8d3215f1de210cd4f0e490b677d4, 6321cda851e4153be6da08e5d0f0d4f92ef20db4 and d111a5a867c847ed3e9008ca69f62bd5ea79a90e) some issues in reading and writing UTF16/UINT16 encoded character data.