tbeu / matio

MATLAB MAT File I/O Library
https://matio.sourceforge.io
BSD 2-Clause "Simplified" License
330 stars 97 forks source link

Using strlen to detect the number of bytes breaks the usage of char arrays #165

Closed S-Dafarra closed 3 years ago

S-Dafarra commented 3 years ago

In https://github.com/tbeu/matio/commit/67a922f83467d694fa6e9759ac9a30b6ab82aec4 the following two lines have been added https://github.com/tbeu/matio/blob/67a922f83467d694fa6e9759ac9a30b6ab82aec4/src/mat.c#L1116-L1118

The use of strlen needs the data to have a \0 char at the end. This is ok when the input data is a C string, but it causes segfaults when the input type is a char array.

S-Dafarra commented 3 years ago

It also causes segfaults in case the input data pointer is nullptr. This is acceptable for every other data type, as long there is a zero dimension. With MAT_C_CHAR and MAT_T_UTF8 instead, it will try to access data anyway.

tbeu commented 3 years ago

The use of strlen needs the data to have a \0 char at the end. This is ok when the input data is a C string, but it causes segfaults when the input type is a char array.

That's why I set the trailing zero also for char array here: https://github.com/tbeu/matio/blob/84c8f3d67ef7493b3541ffe16fdfe007a45f3165/test/test_mat.c#L1302-L1303

Do you see any other option how to get the string length (which can be different from actual dimension)?

It also causes segfaults in case the input data pointer is nullptr.

Sure, this needs to be avoided.

tbeu commented 3 years ago

It also causes segfaults in case the input data pointer is nullptr.

Addressed by fa770b2.

S-Dafarra commented 3 years ago

Addressed by fa770b2.

Thank you for the quick fix!

Do you see any other option how to get the string length (which can be different from actual dimension)?

I have been thinking about it a bit. If I understood correctly the issue reported in https://github.com/tbeu/matio/issues/164, the problem is related to strings (and also arrays) containing UTF8 chars. Since the UTF8 chars require additional bytes to be stored, it is not true that the number of bytes is equal to the size of the string. In this case, the use of strlen fixed the issue since it counts the number of bytes correctly.

Ultimately, the problem seems to be on the correct counting of bytes then, both in the case of strings and arrays, but strlen requires the \0 to be present.

From https://github.com/tbeu/matio/issues/164 it also seems that the dimensions need to be set to the actual number of chars, not to the number of bytes. So, both in the case of strings and arrays, the user is supposed to provide the number of elements he/she wants to store.

I wonder if information this can be exploited in this case, i.e. mimicking how strlen works. In practice, it would be necessary to iterate over the number of elements, detect if the element is a plain char or a special UTF character, and count the bytes accordingly.

I was thinking maybe at something similar to https://stackoverflow.com/a/1031773, but with a different while loop condition (since while(*bytes) would require again the terminal \0) :thinking:

tbeu commented 3 years ago

Fixed by 4e6cc2e. I guess I need to a release a new version soon.

S-Dafarra commented 3 years ago

Awesome, thanks!