Closed AndrePatri closed 1 year ago
Apparently, a conversion of the read data from utf16 to utf8 fixes the problem. I do not fully understand, however, why upon writing the variable to file, its data type is internally changed from MAT_T_UTF8 to MAT_T_UINT16.
Just some early remarks:
Thanks for the clarification @tbeu. My file was saved using the MAT_FT_MAT73 file version.
Here's like MATLAB treats the same case (with MAT files written by MATLAB R2022a and dumped by latest matio):
$ matdump -d test-utf8-m-v4.mat a
Name: a
Rank: 2
Dimensions: 1 x 4
Class Type: Character Array
Data Type: 8-bit, unsigned integer
{
test
}
$ matdump -d test-utf8-m-v6.mat a
Name: a
Rank: 2
Dimensions: 1 x 4
Class Type: Character Array
Data Type: 16-bit, unsigned integer
{
test
}
$ matdump -d test-utf8-m-v7.mat a
Name: a
Rank: 2
Dimensions: 1 x 4
Class Type: Character Array
Data Type: Unicode UTF-8 Encoded Character Data
{
test
}
$ matdump -d test-utf8-m-v7.3.mat a
Name: a
Rank: 2
Dimensions: 1 x 4
Class Type: Character Array
Data Type: 16-bit, unsigned integer
{
test
}
Thus, there is no UTF-8 support for character arrays saved by the -v7.3 option in MATLAB. If you use the -v7 option of the MATLAB save command or MAT_FT_MAT5 in matio the character array is not converted to UINT16 but kept as UTF-8 (if it fits in the Basic Multilingual Plane).
This clarifies a lot. Thank @tbeu. So, currently my code is checking whether the character array UINT16 or UINT8/ UTF8. If it is found to be UINT16, I then apply a conversion method which returns a std::string. My question is, do I run the risk of losing information during the conversion from UTF8 to UTF16? The method I am using for converting is the following;
std::string utf16_utf8(std::u16string source)
{
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string dest = convert.to_bytes(source);
return dest;
}
I assume there's still some stuff going wrong with unicode characters while writing them to .mat files using MAT_FT_MAT73.
For example : - When I'm trying to write Omega symbol in the .mat files, I can't see the symbol for that in .mat file
------------------------Creating a file---------------------------------------------- mat_t *pmat2=Mat_CreateVer("test_73.mat",NULL,MAT_FT_MAT73);
--------------------------Creating a matvar_t variable------------------------- static const QString qstr = QString(QChar(0x03A9)); std::string dataString= ("Gen Imp Drop ("+ qstr +")").toStdString(); size_t sz[2] = { 1, dataString.size() }; matvar_t matString=Mat_VarCreate("charArray",MAT_C_CHAR,MAT_T_UTF8,2,sz,(void )dataString.c_str(),0);
And the output when I print this variable using Mat_VarPrint() is as follows:
Name: charArray Rank: 2 Dimensions: 1 x 17 Class Type: Character Array Data Type: 16-bit, unsigned integer { Gen Imp Drop () }
And when I write this string in file version having value as MAT_FT_MAT5, it gives me correct output. Can someone please help me debug this? Is there something I am missing here? Thanks in advance
Hi, I am experiencing problems trying to read a previously written MAT_C_CHAR variable, with UTF8 encoding.
Please note that I am using the latest version of MatIO available at the time of writing this issue.
I have a method which takes as input a std::string& text. Suppose the value assigned to text is "test".
The method creates the associated variable in the following way using MatIO functions:
Somewhere else in the code the variable is written to file. After loading the file on Matlab, I am able to visualize correctly the variable. If I try to load the written file using MatIO and print the variable using MatIo internal print function I get the following output:
As you see the variable is printed without problems, but the data type somehow changed to MAT_T_UINT16.
However, when I try the following on the loaded variable:
and I try to print it to cout, I get only
te
. If I try to print the raw data elements using the following codeI get:
Apparently, zeros are inserted for some reason between characters and, as a consequence, I am not able to reach the end of the word.
I I try the same procedure with the variable before writing, I am able to get the full word without problems. This suggests that something nasty might be happening upon variable reading (I used Mat_VarRead method).
Can someone please help me debug this? Is there something I am missing to properly assign the raw data to a std::string? Thanks in advance