tbeu / matio

MATLAB MAT File I/O Library
https://matio.sourceforge.io
BSD 2-Clause "Simplified" License
330 stars 97 forks source link

UTF-8 string in level 5 mat file fails to load or is corrupted #164

Closed russelburgess closed 3 years ago

russelburgess commented 3 years ago

I am having issues with UTF-8 strings causing mat files to fail to load. Below is a test case that writes a character array named "x" containing the string "test°test" to test.mat. Until R2020a this loaded in matlab, though with some corruption (a couple of random bytes on the end of the string), in R2020b this fails with "Error using load / Cannot read file D:...\test.mat."

#include <stdio.h>
#include <string.h>
#include "matio.h"

int main()
{
  mat_t *matfp;
  matvar_t *matvar;

  char data[] = "test\u00B0test"; // "test" + degree symbol + "test"
  size_t dims[2];

  matfp = Mat_CreateVer("test.mat", "Level 5 mat file UTF8 test", MAT_FT_MAT5);
  if (matfp == NULL) {
    printf("MAT - Error creating file.\n");
    return 1;
  }

  dims[0] = 1;
  dims[1] = strlen(data); // Returns the byte length of data (10) not the number of UTF-8 codes (9)
  matvar = Mat_VarCreate("x", MAT_C_CHAR, MAT_T_UTF8, 2, dims, (char *)data, MAT_F_DONT_COPY_DATA);
  if (matvar == NULL) {
    printf("MAT - DATA variable creation failed.\n");
    return 1;
  }

  Mat_VarWrite(matfp, matvar, MAT_COMPRESSION_NONE);
  Mat_VarFree(matvar);

  Mat_Close(matfp);

  return 0;
}

It would appear that the issue is caused by the dims[] being used for both the dimensions of the character array and for the number of bytes of data. That is, matlab expects the dimensions of the string "test°test" to be 1 x 9 (counting "°" as one), but the string occupies 10 bytes in the mat file, and matio writes the same number for both.

This can be seen by inspecting the hex dump of the file produced ("test.mat'):

```
4C 65 76 65 6C 20 35 20 6D 61 74 20 66 69 6C 65
20 55 54 46 38 20 74 65 73 74 00 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 00 01 49 4D
0E 00 00 00 40 00 00 00 06 00 00 00 08 00 00 00
04 00 00 00 00 00 00 00 05 00 00 00 08 00 00 00
01 00 00 00 0A 00 00 00 01 00 01 00 78 00 00 00
10 00 00 00 0A 00 00 00 74 65 73 74 C2 B0 74 65
73 74 00 00 00 00 00 00
```

The first bolded byte is specifying the second dimension of the array; the second bolded byte is specifying the number of bytes of data in the element. Both of these originate in dims[1].

By changing that first byte from 0A to 09, R2020a and R2020b both load the file correctly.

tbeu commented 3 years ago

Thanks for reporting. Especially the changed behavior in MATLAB is interesting. But, isn't this an application issue actually? If you replace the call of strlen by something like

// See https://stackoverflow.com/a/32936928/8520615
size_t count_utf8_code_points(const char *s) {
    size_t count = 0;
    while (*s) {
        count += (*s++ & 0xC0) != 0x80;
    }
    return count;
}

you get the expected length (on Linux). By the way, running strlen with MSVC (on Win) also gives the expected length. Thus, there definetely is a platform dependency here, which I have not yet figured out.

See also https://godbolt.org/z/T6GMb5 for some experimenting.

tbeu commented 3 years ago

Ah, I understand better now. Array length and number of bytes must be set differently. Let me see what I can do.

tbeu commented 3 years ago

By changing that first byte from 0A to 09, R2020a and R2020b both load the file correctly.

It would not in R2019b where there is no proper UTF-8 support for MAT files. See also #79. It seems The MathWorks recently has improved UTF-8 support in MATLAB. Can you check if the files UTF8-Counterexample.zip and UTF32-Counterexample.zip (taken from blog.omega-prime.co.uk) can be successfully loaded in MATLAB R2020a and R2020b?

By the way, running strlen with MSVC (on Win) also gives the expected length. Thus, there definetely is a platform dependency here, which I have not yet figured out.

For MSVC I need to set char data[] = "test\xC2\xB0test"; // "test" + degree symbol + "test" to get the expected behaviour.

I also tried to address the issue by 836f8dd91a288ad009e33856c22eb1298e62a726 (still WIP). Nevertheless, you need to adapt the dimension in the application code, for example dims[1] = count_utf8_code_points(data);.

russelburgess commented 3 years ago

Ah, can't believe I missed that previous issue #! Neither of UTF8-Counterexample.zip and UTF32-Counterexample.zip loaded in R2020a or R2020b, both fail with the same error as before "Error using load / Cannot read file D:...\UTF8-Counterexample.mat." Presumably MATLAB still can't read code points > 2 bytes then.

However using commit 836f8dd and count_utf8_code_points() does work (for the example string I gave, loads in both R2020a and R2020b). I should have mentioned before that I'm working in msys2, gcc 9.3.0, under Windows 10 64-bit.

I wasn't aware strlen() was platform dependent so that's good to know. FWIW for me strlen("test\u00B0test") returns:

But... strlen("test\xC2\xB0test") returns:

Which is...confusing. As best as I can tell MSVC is turning "\u00B0" into literally just the byte B0, whereas gcc is turning it into C2B0.

It seems like that commit will fix the issue as much as can be fixed on matio's end, so I can close the issue unless there are any other tests in R2020a/b you'd like me to run?

tbeu commented 3 years ago

Thanks for confirmation and further analysis. Please leave this issue open for now as I need to run the testsuite through latest MATLAB. I will close it once it is properly addressed and commited to master. Afterwards I will release a new version.

russelburgess commented 3 years ago

Awesome, thanks for the prompt response to all this!

tbeu commented 3 years ago

It's now fixed on master branch.