I don't think `MultiByteToWideChar` and `WideCharToMultiByte` need to add 1 to `num_bytes`

DavisVaughan commented 11 months ago

I believe these +1 additions to num_bytes are not required, and are actually resulting in strings that are too long: https://rodrigocfd.github.io/winsafe/src/winsafe/kernel/funcs.rs.html#1483 https://rodrigocfd.github.io/winsafe/src/winsafe/kernel/funcs.rs.html#1725

MultiByteToWideChar() takes a multi_byte_str: &[u8] and gets its size with multi_byte_str.len()
WideCharToMultiByte() takes a wide_char_str: &[u16] and gets its size with wide_char_str.len()

So for the inputs to those functions, the nul terminator should already be included in the length if there is one.

And then for the output, I'm looking at: https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar#parameters https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte#parameters

For MultiByteToWideChar:

For cbMultiByte (only included the "positive integer" case):

Size, in bytes, of the string indicated by the lpMultiByteStr parameter. ... If this parameter is set to a positive integer, the function processes exactly the specified number of bytes. If the provided size does not include a terminating null character, the resulting Unicode string is not null-terminated, and the returned length does not include this character.

For cchWideChar:

Size, in characters, of the buffer indicated by lpWideCharStr. If this value is 0, the function returns the required buffer size, in characters, including any terminating null character, and makes no use of the lpWideCharStr buffer.

For WideCharToMultiByte:

For cchWideChar (only included the "positive integer" case):

Size, in characters, of the string indicated by lpWideCharStr. ... If this parameter is set to a positive integer, the function processes exactly the specified number of characters. If the provided size does not include a terminating null character, the resulting character string is not null-terminated, and the returned length does not include this character.

For cbMultiByte:

Size, in bytes, of the buffer indicated by lpMultiByteStr. If this value is 0, the function returns the required buffer size, in bytes, including any terminating null character, and makes no use of the lpMultiByteStr buffer.

This documentation makes me think that the size returned from these functions when passing a null buffer already includes the nul terminator if one is present in the string, so adding 1 ends up adding extra space.

Indeed that seems to be the case in my testing as well. i.e. roundtripping an x containing something like "hi there" through this helper will result in a string with 2 extra \0 on the end. I manually removed the +1 and things start looking as I'd expect.

pub fn system_to_utf8(x: &[u8], code_page: CP) -> anyhow::Result<String> {
    let flags = MBC::NoValue;

    let x = MultiByteToWideChar(code_page, flags, x)?;

    let flags = unsafe { WC::from_raw(0) };
    let default_char = None;
    let used_default_char = None;

    let x = WideCharToMultiByte(CP::UTF8, flags, &x, default_char, used_default_char)?;

    let x = String::from_utf8(x)?;

    Ok(x)
}

rodrigocfd commented 11 months ago

I'm aware of the docs, but since these functions are at the core of the library, I was just over precautious of that extra null room... and that's because Win32 functions don't follow a standard: some of them include null in the computation, while others don't. So I chose not to take any chances.

Also: what happens if the input buffer doesn't have a terminating null?

DavisVaughan commented 11 months ago

Win32 functions don't follow a standard: some of them include null in the computation, while others don't

IIUC, the null behavior here is very well documented in the docs I copied in above. I definitely understand that the Windows API is the wild west, but I do think the behavior is stable here if I am reading it correctly!

FWIW, we've had to use these functions quite a bit when working on Windows with R and we've never had issues with needing the extra +1 https://github.com/search?q=org%3Ar-lib%20MultiByteToWideChar&type=code

Our main IDE, RStudio, also uses them this way: https://github.com/rstudio/rstudio/blob/63be0dd74313574df1fc69e763c9824f232b521e/src/cpp/core/StringUtils.cpp#L349-L384

The C++ library, <date>, which was accepted into C++20 also uses it: https://github.com/HowardHinnant/date/blob/cc4685a21e4a4fdae707ad1233c61bbaff241f93/src/tz.cpp#L201-L228

what happens if the input buffer doesn't have a terminating null

That is also well documented in what i copied in from above

If the provided size does not include a terminating null character, the resulting Unicode string is not null-terminated, and the returned length does not include this character

And in fact that is often the case for the RStudio IDE. On Windows we can get a crazy string like <system><UTF8marker><UTF8><UTF8markerend><system> where:

<system> is a chunk of the string that is system coded
<UTF8marker><UTF8><UTF8markerend> defines boundaries where <UTF8> is a chunk of the string that can be assumed to already be UTF8 encoded.

(Yes, it is crazy)

We end up having to take a string slice to extract out the <system> bits and use a helper like the one above to convert them to UTF8, and in that case it won't be nul terminated, but that's totally fine. The len() method from Rust tells the windows functions how much of the string to look at, and the result won't be nul terminated either (again, a good thing for us since this is just a piece of the whole string)

rodrigocfd commented 11 months ago

Very interesting stuff, thank you. I removed that +1 and I couldn't find any problem, indeed.

DavisVaughan commented 11 months ago

I really appreciate that! Thanks!

rodrigocfd / winsafe

I don't think `MultiByteToWideChar` and `WideCharToMultiByte` need to add 1 to `num_bytes` #111