tbeu / matio

MATLAB MAT File I/O Library
https://matio.sourceforge.io
BSD 2-Clause "Simplified" License
330 stars 97 forks source link

Not able to properly decode char encoded using MAT_T_UTF8 #189

Closed AndrePatri closed 1 year ago

AndrePatri commented 2 years ago

Hi, I am experiencing problems trying to read a previously written MAT_C_CHAR variable, with UTF8 encoding.

Please note that I am using the latest version of MatIO available at the time of writing this issue.

I have a method which takes as input a std::string& text. Suppose the value assigned to text is "test".

The method creates the associated variable in the following way using MatIO functions:

const int mat_rank = 2;
std::size_t dims[mat_rank];
dims[0] = 1;
dims[1] = text.size();

matvar_t* mat_var = Mat_VarCreate(_name.c_str(),
                               MAT_C_CHAR,
                               MAT_T_UTF8,
                               mat_rank,
                               dims,
                               (void *)text.data(),
                               MAT_F_DONT_COPY_DATA);

Somewhere else in the code the variable is written to file. After loading the file on Matlab, I am able to visualize correctly the variable. If I try to load the written file using MatIO and print the variable using MatIo internal print function I get the following output:

-------- Printing from MatIO internal function -------- 
Rank: 2
Dimensions: 1 x 4
Class Type: Character Array
 Data Type: 16-bit, unsigned integer
{
test
}

As you see the variable is printed without problems, but the data type somehow changed to MAT_T_UINT16.

However, when I try the following on the loaded variable:

std::string strng;
strng.assign( (char*) mat_var->data, mat_var->dims[1]);

and I try to print it to cout, I get only te. If I try to print the raw data elements using the following code

for (int i = 0; i < mat_var->dims[1]; i++)
{
    std::cout << "element n. "<< i << ": "<< int(((uint8_t*) mat_var->data)[i]) << std::endl; 
}

I get:

element n. 0: 116
element n. 1: 0
element n. 2: 101
element n. 3: 0

Apparently, zeros are inserted for some reason between characters and, as a consequence, I am not able to reach the end of the word.

I I try the same procedure with the variable before writing, I am able to get the full word without problems. This suggests that something nasty might be happening upon variable reading (I used Mat_VarRead method).

Can someone please help me debug this? Is there something I am missing to properly assign the raw data to a std::string? Thanks in advance

AndrePatri commented 2 years ago

Apparently, a conversion of the read data from utf16 to utf8 fixes the problem. I do not fully understand, however, why upon writing the variable to file, its data type is internally changed from MAT_T_UTF8 to MAT_T_UINT16.

tbeu commented 2 years ago

Just some early remarks:

AndrePatri commented 2 years ago

Thanks for the clarification @tbeu. My file was saved using the MAT_FT_MAT73 file version.

tbeu commented 2 years ago

Here's like MATLAB treats the same case (with MAT files written by MATLAB R2022a and dumped by latest matio):

$ matdump -d test-utf8-m-v4.mat a
      Name: a
      Rank: 2
Dimensions: 1 x 4
Class Type: Character Array
 Data Type: 8-bit, unsigned integer
{
test
}

$ matdump -d test-utf8-m-v6.mat a
      Name: a
      Rank: 2
Dimensions: 1 x 4
Class Type: Character Array
 Data Type: 16-bit, unsigned integer
{
test
}

$ matdump -d test-utf8-m-v7.mat a
      Name: a
      Rank: 2
Dimensions: 1 x 4
Class Type: Character Array
 Data Type: Unicode UTF-8 Encoded Character Data
{
test
}

$ matdump -d test-utf8-m-v7.3.mat a
      Name: a
      Rank: 2
Dimensions: 1 x 4
Class Type: Character Array
 Data Type: 16-bit, unsigned integer
{
test
}

Thus, there is no UTF-8 support for character arrays saved by the -v7.3 option in MATLAB. If you use the -v7 option of the MATLAB save command or MAT_FT_MAT5 in matio the character array is not converted to UINT16 but kept as UTF-8 (if it fits in the Basic Multilingual Plane).

AndrePatri commented 2 years ago

This clarifies a lot. Thank @tbeu. So, currently my code is checking whether the character array UINT16 or UINT8/ UTF8. If it is found to be UINT16, I then apply a conversion method which returns a std::string. My question is, do I run the risk of losing information during the conversion from UTF8 to UTF16? The method I am using for converting is the following;

std::string utf16_utf8(std::u16string source)
{
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert; 

    std::string dest = convert.to_bytes(source);

    return dest;
}
divyesh19399 commented 1 year ago

I assume there's still some stuff going wrong with unicode characters while writing them to .mat files using MAT_FT_MAT73.

For example : - When I'm trying to write Omega symbol in the .mat files, I can't see the symbol for that in .mat file

------------------------Creating a file---------------------------------------------- mat_t *pmat2=Mat_CreateVer("test_73.mat",NULL,MAT_FT_MAT73);

--------------------------Creating a matvar_t variable------------------------- static const QString qstr = QString(QChar(0x03A9)); std::string dataString= ("Gen Imp Drop ("+ qstr +")").toStdString(); size_t sz[2] = { 1, dataString.size() }; matvar_t matString=Mat_VarCreate("charArray",MAT_C_CHAR,MAT_T_UTF8,2,sz,(void )dataString.c_str(),0);

And the output when I print this variable using Mat_VarPrint() is as follows:

Name: charArray Rank: 2 Dimensions: 1 x 17 Class Type: Character Array Data Type: 16-bit, unsigned integer { Gen Imp Drop () }

And when I write this string in file version having value as MAT_FT_MAT5, it gives me correct output. Can someone please help me debug this? Is there something I am missing here? Thanks in advance