Closed MaartenBent closed 2 years ago
Thanks for reporting. I need to check thoroughly since it also is different for MAT_FT_MAT5
files. Also some unit tests are failing now.
I noticed the test failure as well. It assumes 8-bit
in the output, but with this change it is now 16-bit
.
I also noticed something else related to writing characters to mat73. Does the following code mean that it writes both 8-bit and 16-bit characters using the same (16-bit) class? https://github.com/tbeu/matio/blob/1d200ac6e36790b093590388a59c607e1e27e51a/src/mat73.c#L1610-L1618
Yes, that's what it means. See also here: https://github.com/tbeu/matio/blob/1d200ac6e36790b093590388a59c607e1e27e51a/src/mat5.c#L748
I didn't know about that. The comment is in mat5, is this the same for mat73/hdf5? I don't have Matlab myself so I can not verify if created files are read correctly by Matlab.
I was playing around with creating a mat73/hdf5 file that contains utf8, utf16 and utf32 text, and patched mat73.c
so it reads and writes this correctly. It seems to work with matio/matdump, but I don't know about Matlab. I'll add it in a commit, some of the changes might be useful anyway.
After above commit my test code seems to work fine for mat5.
The special case is only for MAT_T_INT8
, not for MAT_T_UTF8
and MAT_T_UINT8
.
When changing the data types to MAT_T_INT8
or MAT_T_INT16
in my test, either matio fails writing, or matdump fails reading the file.
When enabling compression and using data type MAT_T_UINT8
, there is also an error (fields[5], Uncompressed type not MAT_T_MATRIX
). Using data type MAT_T_UTF8
works fine.
Thank, I will check in MATLAB R2021b later. Please have a read of this blog post (as linked from the wiki) where you can read that MATLAB does not support its very own MAT5 file format specification for UTF-8 and UTF32. That was the reason for the conversion of UTF-8 encoded strings to UTF-16 in Mat_VarWriteChar73
(since it also holds for MAT73).
The option might be to introduce a new compile flag MATIO_EXTENDED_UNICODE
(similar as MATIO_EXTENDED_SPARSE
) which differs between specification and MATLAB capabilities.
Thanks for the links. If this indeed does not work in Matlab I see two options:
1) The current solution where you convert (some) data to UTF-16. But in this case my first commit is needed so matdump also reads it as UTF-16.
But so far, only MAT_T_UTF8
in mat73 is converted. In mat5 nothing is converted, only MAT_T_INT8
is changed to MAT_T_UINT16
.
2) Follow the spec and have some files possible not work in Matlab. Inform the user to use MAT_T_UTF16
for the best results.
This is basically my second commit, removing the mat73 MAT_T_UTF8
conversion.
Adding UTF-32 support to mat5 and fixing it in mat73 seems simple enough (second and third commit). But the question is if Matlab supports it. Though I still think supporting it in matio is better than the current behaviour of failing to create a mat5 file, or creating a mat73 file with the wrong matlab_int_decode
value.
Also, any idea why this PR doesn't trigger TravisCI or Coverage?
Also, any idea why this PR doesn't trigger TravisCI or Coverage?
Quick reply on CI: See https://stackoverflow.com/questions/61495766/pull-requests-from-forks-does-not-trigger-travis-ci. Since you are contributor you can push to repo origin branch instead of your own fork.
I see two options:
Quick reply on that one, too: Yes, MATIO_EXTENDED_UNICODE
as edited above, might be a solution.
Thanks for merging the first commit. I'll close this PR.
I'll clean the commits up a bit too, in case this becomes useful in the future.
I noticed an issue when reading utf16 character data from a struct in a mat73 file. I created a mat file using the following code (based on
test_write_char_unicode
andtest_write_struct_char
);The data type of
matdump test_mat73_struct_utf16.mat
is correct:But the data type of
matdump -d test_mat73_struct_utf16.mat
is wrong:This is because
data_type
is forced toMAT_T_UINT8
. I changed this so it now uses the actualdata_type
. The output is now: