Closed DavisVaughan closed 11 months ago
I'm aware of the docs, but since these functions are at the core of the library, I was just over precautious of that extra null room... and that's because Win32 functions don't follow a standard: some of them include null in the computation, while others don't. So I chose not to take any chances.
Also: what happens if the input buffer doesn't have a terminating null?
Win32 functions don't follow a standard: some of them include null in the computation, while others don't
IIUC, the null behavior here is very well documented in the docs I copied in above. I definitely understand that the Windows API is the wild west, but I do think the behavior is stable here if I am reading it correctly!
FWIW, we've had to use these functions quite a bit when working on Windows with R and we've never had issues with needing the extra +1 https://github.com/search?q=org%3Ar-lib%20MultiByteToWideChar&type=code
Our main IDE, RStudio, also uses them this way: https://github.com/rstudio/rstudio/blob/63be0dd74313574df1fc69e763c9824f232b521e/src/cpp/core/StringUtils.cpp#L349-L384
The C++ library, <date>
, which was accepted into C++20 also uses it:
https://github.com/HowardHinnant/date/blob/cc4685a21e4a4fdae707ad1233c61bbaff241f93/src/tz.cpp#L201-L228
what happens if the input buffer doesn't have a terminating null
That is also well documented in what i copied in from above
If the provided size does not include a terminating null character, the resulting Unicode string is not null-terminated, and the returned length does not include this character
And in fact that is often the case for the RStudio IDE. On Windows we can get a crazy string like <system><UTF8marker><UTF8><UTF8markerend><system>
where:
<system>
is a chunk of the string that is system coded<UTF8marker><UTF8><UTF8markerend>
defines boundaries where <UTF8>
is a chunk of the string that can be assumed to already be UTF8 encoded.(Yes, it is crazy)
We end up having to take a string slice to extract out the <system>
bits and use a helper like the one above to convert them to UTF8, and in that case it won't be nul terminated, but that's totally fine. The len()
method from Rust tells the windows functions how much of the string to look at, and the result won't be nul terminated either (again, a good thing for us since this is just a piece of the whole string)
Very interesting stuff, thank you. I removed that +1
and I couldn't find any problem, indeed.
I really appreciate that! Thanks!
I believe these
+1
additions tonum_bytes
are not required, and are actually resulting in strings that are too long: https://rodrigocfd.github.io/winsafe/src/winsafe/kernel/funcs.rs.html#1483 https://rodrigocfd.github.io/winsafe/src/winsafe/kernel/funcs.rs.html#1725MultiByteToWideChar()
takes amulti_byte_str: &[u8]
and gets its size withmulti_byte_str.len()
WideCharToMultiByte()
takes awide_char_str: &[u16]
and gets its size withwide_char_str.len()
So for the inputs to those functions, the nul terminator should already be included in the length if there is one.
And then for the output, I'm looking at: https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar#parameters https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte#parameters
For MultiByteToWideChar:
For cbMultiByte (only included the "positive integer" case):
For cchWideChar:
For WideCharToMultiByte:
For cchWideChar (only included the "positive integer" case):
For cbMultiByte:
This documentation makes me think that the size returned from these functions when passing a null buffer already includes the nul terminator if one is present in the string, so adding 1 ends up adding extra space.
Indeed that seems to be the case in my testing as well. i.e. roundtripping an
x
containing something like"hi there"
through this helper will result in a string with 2 extra\0
on the end. I manually removed the+1
and things start looking as I'd expect.