swig / swig

SWIG is a software development tool that connects programs written in C and C++ with a variety of high-level programming languages.
http://www.swig.org
Other
5.59k stars 1.22k forks source link

SWIG_AsWCharPtrAndSize does not work correctly on Windows with code point > 2 byte #2909

Open Daniel-da6a opened 1 month ago

Daniel-da6a commented 1 month ago

Current state on master brach / Swig v4.2.1:

For wide strings the fragment SWIG_AsWCharPtrAndSize (Lib/python/pywstrings.swg) is used. This function does not return the correct wchar_t array on Windows, if the original UTF-8 string contains code points which need more than two bytes for their representation.

For example, the UTF-8 string in Python is "🤠ABC" will be returned as "🤠AB".

This is caused by the use of PyUnicode_GetSize in combination with PyUnicode_AsWideChar and the fact, that wchar_t is only two bytes on Windows.

PyUnicode_GetSize is used to obtain the size in code units, for the example above this would be 4. The function PyUnicode_AsWideChar reads at most size wchar_t characters. Here the miss match is happening, since wchar_t is only 2 bytes on windows, the number of wchar_t characters (5) is not the same as the numer of code units (4). As a result not all of the characters are read.

https://github.com/swig/swig/blob/7c2b245ceafb49552e559f8056c2618e84aad0b7/Lib/python/pywstrings.swg#L31C1-L44C74

The use of PyUnicode_AsWideCharString might be a solution. Alternatively PyUnicode_AsWideChar(SWIGPY_UNICODE_ARG(obj), NULL, 0) could be used to obtain the correct number of wchar_t elements on Windows.