golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.98k stars 17.67k forks source link

proposal: x/sys/windows/mkwinsyscall: support UTF16 functions not ending with W #51786

Open qmuntal opened 2 years ago

qmuntal commented 2 years ago

Background

Microsoft normally provides two parallel sets of APIs, one for ANSI strings, whose function names end with A, and the other for Unicode strings, whose function names end with W (source).

This pattern is well supported in x/sys/windows/mkwinsyscall: function parameters of type string are transformed using syscall.UTF16PtrFromString if the function names ends with W, and using syscall.BytePtrFromString otherwise:

https://github.com/golang/sys/blob/2edf467146b5fc89e484991587e3032c8421ae8c/windows/mkwinsyscall/mkwinsyscall.go#L616-L622

Problem

Newer libraries that were created after Windows introduced Unicode tend to provide just the UTF16 API set without the W-ending convention. In my particular case, I'm creating bindings for the BCrypt library, in which all functions accepting strings expect them to be UTF16 encoded even when the function name does not end with W.

This forces me to define the //sys functions using *uint16 instead of string and to do the conversion outside the autogenerated code.

Request

Add a knob to x/sys/windows/mkwinsyscall to decide if a string parameter should be transformed to ASCII or Unicode.

Two complementary solutions come to mind:

The first solution will probably be enough to cover all cases, as new Windows APIs are Unicode by default and only legacy APIs expect ASCII strings, in which case they already provide the both API sets.

cc @zx2c4

ianlancetaylor commented 2 years ago

CC @alexbrainman

zx2c4 commented 2 years ago

What if we just unconditionally change it to default to utf16? Or perhaps always require either -utf16 or -ascii, but not none. I don't think Go promises any stability w.r.t. autogenerators, right?

The reason I suggest this is that I'd hate for the behavior of mkwinsyscall to become, "it does the old bad thing unless you remember to pass this special switch." I'd rather it be "it does the new good thing, unless you tell it to do the old bad thing" or "you have to tell it which thing you want to do, so that it won't do the old bad thing without you telling it to."

I suspect changing it to default to utf16 would make most sense, considering most if not all functions are already wchar.

alexbrainman commented 2 years ago

@qmuntal

This forces me to define the //sys functions using *uint16 instead of string and to do the conversion outside the autogenerated code.

I don't see using *uint16 instead of string as a bad thing. Users of your package would have to explicitly call windows.UTF16PtrFromString, but they could save some memory allocations because of that. I don't know if memory saving is important when using BCrypt library, but I always try not to hide memory allocations when creating low level libraries.

I suggest you copy x/sys/windows/mkwinsyscall into your package and adjust it as required (probably 1 or 2 lines change), and use it for your case. If you discover more use cases for your proposal, I am happy to reconsider.

@zx2c4

What if we just unconditionally change it to default to utf16?

That will break some users code. I suggested a reasonable workaround for @qmuntal problem. Why breaking users code is better than my workaround?

Or perhaps always require either -utf16 or -ascii, but not none

That will definitely break users code. And your proposed flags will force current users split their code into 2 different source files: ascii version and utf16 version.

I am not convinced this is a problem that we need to solve. Yet.

But feel free to ignore my opinion.

Alex

qmuntal commented 2 years ago

@alexbrainman

I don't see using *uint16 instead of string as a bad thing. Users of your package would have to explicitly call windows.UTF16PtrFromString, but they could save some memory allocations because of that.

windows.UTF16PtrFromString returns a pointer to the UTF16 string and an error, so we are forcing users to check that error before calling the autogenerated function, whereas if the autogenerated function would do the conversion this boilerplate code could be eliminated.

I don't know if memory saving is important when using BCrypt library, but I always try not to hide memory allocations when creating low level libraries.

I agree with your statement about not hiding allocations, but this proposal don't preclude //sys definitions to use *uint16 instead of string, it's just about making string params to be handled correctly regardless of the function naming convention. IMO we are not adding a new way to hide allocations, as functions accepting string params whose name end with W or A are already hiding allocations.

Having low memory allocation is primordial when using BCrypt library, but for my use case functions accepting UTF16 strings are not supposed to be called multiple times with the same string, so it makes no sense to cache the windows.UTF16PtrFromString result in order to reduce allocations. Passing the string directly to the autogenerated function would not have any performance impact, but would be more ergonomic.

I suggest you copy x/sys/windows/mkwinsyscall into your package and adjust it as required (probably 1 or 2 lines change), and use it for your case. If you discover more use cases for your proposal, I am happy to reconsider.

If you think this proposal covers a niche use case, I'm fine patching mkwinsyscall. But, as @zx2c4 has also mentioned, UTF16 strings are the standard for new Windows APIs, and it would be a pity to lose the capacity of passing string arguments instead of *uint16 to new Windows function.

alexbrainman commented 2 years ago

whereas if the autogenerated function would do the conversion this boilerplate code could be eliminated.

Yes. If you don't care about allocations, then you can save your users from writing boilerplate code.

this proposal don't preclude //sys definitions to use *uint16 instead of string, it's just about making string params to be handled correctly regardless of the function naming convention. IMO we are not adding a new way to hide allocations, as functions accepting string params whose name end with W or A are already hiding allocations.

Agree.

Having low memory allocation is primordial when using BCrypt library, but for my use case functions accepting UTF16 strings are not supposed to be called multiple times with the same string, so it makes no sense to cache the windows.UTF16PtrFromString result in order to reduce allocations. Passing the string directly to the autogenerated function would not have any performance impact, but would be more ergonomic.

Thanks for explaining.

If you think this proposal covers a niche use case, ...

I do think that. Maybe my opinion will change in the future.

... I'm fine patching mkwinsyscall. But, as @zx2c4 has also mentioned, UTF16 strings are the standard for new Windows APIs, and it would be a pity to lose the capacity of passing string arguments instead of *uint16 to new Windows function.

You would have to copy mkwinsyscall program and make small changes to it. If you find this troublesome, I am happy to reconsider. I did exactly that in many of my personal projects. I do not regret making that decision after many years.

I also happy to reconsider if other projects find this functionality useful.

I am just trying to defer breaking users code till later.

Alex

gopherbot commented 2 years ago

Change https://go.dev/cl/425054 mentions this issue: syscall: rely on utf16.AppendRune