odin-lang / Odin

Odin Programming Language
https://odin-lang.org
BSD 3-Clause "New" or "Revised" License
6.95k stars 615 forks source link

`wstring_to_utf8` cuts off string at interior null character #1952

Open IanLilleyT opened 2 years ago

IanLilleyT commented 2 years ago

Odin's wstring_to_utf8 searches the string for the first occurrence of 0 and chops off everything after it, which prevents certain kinds of utf8 strings from being formed. In utf8, null aka 0x00 is allowed. There's also some waste from doing another loop over the entire string.

https://github.com/odin-lang/Odin/blob/5a9422b6bcda8ed7fe3f0e91db916764662397e5/core/sys/windows/util.odin#L87-L92

From WideCharToMultiByte's perspective there's no special behavior when it sees a null character as long as you pass an explicit size

WideCharToMultiByte takes a LPCWSTR

An LPCWSTR is a 32-bit pointer to a constant string of 16-bit Unicode characters, which MAY be null-terminated

LPCWSTR (maybe null-terminated) and LPCTSTR (definitely null-terminated) are both wstring so wstring ought to have the semantics of the more permissive of the two, i.e. allowing interior null characters.

So far I've run into at least one problem with removing the null check in the code snippet above: windows get_current_directory allocates a buffer that includes space for a null terminator (mimicking the behavior of GetCurrentDirectoryW), so the slice returned by wstring_to_utf8 is going to end with a null. If you do something like filepath.join({os.get_current_directory, "my_file.txt"}) it will give back a path with a nul between the cwd and the file name. The fix is to do a second allocation in get_current_directory that excludes the null terminator, or do one allocation and return a slice with the null chopped off. I'm not sure how many other places will run into this problem.

Test code:

package main

import "core:sys/windows"
import "core:fmt"

main :: proc() {
    str1 := []windows.WCHAR{'0', 0, '1', '2', '3', '4', '5'}
    str2, _ := windows.wstring_to_utf8(cast(windows.wstring)raw_data(str1), len(str1))
    fmt.println(str2)
    // expected: 012345
    // actual: 0
}
gingerBill commented 2 years ago

So this is not necessarily a bug because the second parameter is the capacity of the buffer, rather than the true length. Your required behaviour is probably better suited with using the following:

buf := make([]byte, 2*len(wstr))
n := utf16.decode_to_utf8(buf, wstr)
str := string(buf[:n])
IanLilleyT commented 2 years ago

I see. One could argue it's a bug that wstring_to_utf8 returns a different utf8 string than expected for a valid wstring that has interior null characters.

I guess I'll rephrase it as a feature request😄

Make wstring_to_utf8 take a slice instead of a pointer + backing buffer length

wstring_to_utf8 is only called from two places: utf16_to_utf8 and add_user_profile. utf16_to_utf8 is effectively passing a slice already. I haven't looked too deeply into add_user_profile but it could count the characters up until null by itself instead of having wstring_to_utf8 do it.

The reason why this and https://github.com/odin-lang/Odin/pull/1951 are useful to me is I'm using the log tracking allocator and want to get the alloc and free numbers to match up exactly where possible.

IanLilleyT commented 2 years ago

Oh wait, wstring is a multipointer, so the function would have to be called something else... I don't know how much confusion would be caused by keeping the name and taking a []WCHAR instead of a wstring

Or keep the function name the same but force N to be the true length

flysand7 commented 4 months ago

works-as-intended?