tbeu / matio

MATLAB MAT File I/O Library
https://matio.sourceforge.io
BSD 2-Clause "Simplified" License
330 stars 97 forks source link

Fix reading 16-bit character data from MAT73 files. #181

Closed MaartenBent closed 2 years ago

MaartenBent commented 2 years ago

I noticed an issue when reading utf16 character data from a struct in a mat73 file. I created a mat file using the following code (based on test_write_char_unicode and test_write_struct_char);

int main(int argc, char *argv[])
{
    mat_t *mat = Mat_CreateVer("test_mat73_struct_utf16.mat", NULL, MAT_FT_MAT73);
    if ( mat == NULL)
        return 1;

    size_t num_fields = 1;
    const char *fieldnames[1] = {"field1"};
    size_t dims[2];

    dims[0] = 2;
    dims[1] = 1;
    matvar_t *struct_matvar = Mat_VarCreateStruct("struct1", 2, dims, fieldnames, num_fields);

    const mat_uint16_t str[] = {1576, 273, 1580, 105, 1604, 7879, 1740, 110};
    dims[0] = 2;
    dims[1] = 4;
    matvar_t *matvar = Mat_VarCreate(fieldnames[0], MAT_C_CHAR, MAT_T_UTF16, 2, dims, (void *)str, MAT_F_DONT_COPY_DATA);
    Mat_VarSetStructFieldByName(struct_matvar, fieldnames[0], 1, matvar);
    Mat_VarWrite(mat, struct_matvar, MAT_COMPRESSION_NONE);
    Mat_VarFree(struct_matvar);
    Mat_Close(mat);

    return 0;
}

The data type of matdump test_mat73_struct_utf16.mat is correct:

      Name: struct1
      Rank: 2
Class Type: Structure
Fields[1] {
      Name: field1
      Rank: 2
Dimensions: 2 x 4
Class Type: Character Array
 Data Type: 16-bit, unsigned integer
}

But the data type of matdump -d test_mat73_struct_utf16.mat is wrong:

      Name: struct1
      Rank: 2
Class Type: Structure
Fields[1] {
      Name: field1
      Rank: 2
Dimensions: 2 x 4
Class Type: Character Array
 Data Type: 8-bit, unsigned integer
{
ÿÿÿÿ
ÿiÿn
}
}

This is because data_type is forced to MAT_T_UINT8. I changed this so it now uses the actual data_type. The output is now:

      Name: struct1
      Rank: 2
Class Type: Structure
Fields[1] {
      Name: field1
      Rank: 2
Dimensions: 2 x 4
Class Type: Character Array
 Data Type: 16-bit, unsigned integer
{
بجلی
điện
}
}
tbeu commented 2 years ago

Thanks for reporting. I need to check thoroughly since it also is different for MAT_FT_MAT5 files. Also some unit tests are failing now.

MaartenBent commented 2 years ago

I noticed the test failure as well. It assumes 8-bit in the output, but with this change it is now 16-bit.

I also noticed something else related to writing characters to mat73. Does the following code mean that it writes both 8-bit and 16-bit characters using the same (16-bit) class? https://github.com/tbeu/matio/blob/1d200ac6e36790b093590388a59c607e1e27e51a/src/mat73.c#L1610-L1618

tbeu commented 2 years ago

Yes, that's what it means. See also here: https://github.com/tbeu/matio/blob/1d200ac6e36790b093590388a59c607e1e27e51a/src/mat5.c#L748

MaartenBent commented 2 years ago

I didn't know about that. The comment is in mat5, is this the same for mat73/hdf5? I don't have Matlab myself so I can not verify if created files are read correctly by Matlab.

I was playing around with creating a mat73/hdf5 file that contains utf8, utf16 and utf32 text, and patched mat73.c so it reads and writes this correctly. It seems to work with matio/matdump, but I don't know about Matlab. I'll add it in a commit, some of the changes might be useful anyway.

test code ```c++ int main(int argc, char *argv[]) { mat_t *mat = Mat_CreateVer("test_mat73_struct_utf16.mat", NULL, MAT_FT_MAT73); if ( mat == NULL) return 1; size_t num_fields = 3; const char *fieldnames[3] = {"field1", "field2", "field3"}; size_t dims[2]; dims[0] = 2; dims[1] = 1; matvar_t *struct_matvar = Mat_VarCreateStruct("struct1", 2, dims, fieldnames, num_fields); const mat_uint16_t str1[] = {1576, 273, 1580, 105, 1604, 7879, 1740, 110}; dims[0] = 1; dims[1] = 8; matvar_t *matvar1 = Mat_VarCreate(fieldnames[0], MAT_C_CHAR, MAT_T_UTF16, 2, dims, (void *)str1, MAT_F_DONT_COPY_DATA); Mat_VarSetStructFieldByName(struct_matvar, fieldnames[0], 1, matvar1); const mat_uint8_t str2[] = {117, 116, 102, 56, 32, 116, 101, 120, 116}; dims[0] = 1; dims[1] = 9; matvar_t *matvar2 = Mat_VarCreate(fieldnames[1], MAT_C_CHAR, MAT_T_UTF8, 2, dims, (void *)str2, MAT_F_DONT_COPY_DATA); Mat_VarSetStructFieldByName(struct_matvar, fieldnames[1], 1, matvar2); const mat_uint32_t str3[] = {117, 116, 102, 51, 50, 33}; dims[0] = 1; dims[1] = 6; matvar_t *matvar3 = Mat_VarCreate(fieldnames[2], MAT_C_CHAR, MAT_T_UTF32, 2, dims, (void *)str3, MAT_F_DONT_COPY_DATA); Mat_VarSetStructFieldByName(struct_matvar, fieldnames[2], 1, matvar3); Mat_VarWrite(mat, struct_matvar, MAT_COMPRESSION_NONE); Mat_VarFree(struct_matvar); Mat_Close(mat); return 0; } ```
MaartenBent commented 2 years ago

After above commit my test code seems to work fine for mat5.

The special case is only for MAT_T_INT8, not for MAT_T_UTF8 and MAT_T_UINT8. When changing the data types to MAT_T_INT8 or MAT_T_INT16 in my test, either matio fails writing, or matdump fails reading the file. When enabling compression and using data type MAT_T_UINT8, there is also an error (fields[5], Uncompressed type not MAT_T_MATRIX). Using data type MAT_T_UTF8 works fine.

tbeu commented 2 years ago

Thank, I will check in MATLAB R2021b later. Please have a read of this blog post (as linked from the wiki) where you can read that MATLAB does not support its very own MAT5 file format specification for UTF-8 and UTF32. That was the reason for the conversion of UTF-8 encoded strings to UTF-16 in Mat_VarWriteChar73 (since it also holds for MAT73).

The option might be to introduce a new compile flag MATIO_EXTENDED_UNICODE (similar as MATIO_EXTENDED_SPARSE) which differs between specification and MATLAB capabilities.

MaartenBent commented 2 years ago

Thanks for the links. If this indeed does not work in Matlab I see two options:

1) The current solution where you convert (some) data to UTF-16. But in this case my first commit is needed so matdump also reads it as UTF-16. But so far, only MAT_T_UTF8 in mat73 is converted. In mat5 nothing is converted, only MAT_T_INT8 is changed to MAT_T_UINT16. 2) Follow the spec and have some files possible not work in Matlab. Inform the user to use MAT_T_UTF16 for the best results. This is basically my second commit, removing the mat73 MAT_T_UTF8 conversion.

Adding UTF-32 support to mat5 and fixing it in mat73 seems simple enough (second and third commit). But the question is if Matlab supports it. Though I still think supporting it in matio is better than the current behaviour of failing to create a mat5 file, or creating a mat73 file with the wrong matlab_int_decode value.

Also, any idea why this PR doesn't trigger TravisCI or Coverage?

tbeu commented 2 years ago

Also, any idea why this PR doesn't trigger TravisCI or Coverage?

Quick reply on CI: See https://stackoverflow.com/questions/61495766/pull-requests-from-forks-does-not-trigger-travis-ci. Since you are contributor you can push to repo origin branch instead of your own fork.

tbeu commented 2 years ago

I see two options:

Quick reply on that one, too: Yes, MATIO_EXTENDED_UNICODE as edited above, might be a solution.

MaartenBent commented 2 years ago

Thanks for merging the first commit. I'll close this PR.

I'll clean the commits up a bit too, in case this becomes useful in the future.