Open rsc opened 13 years ago
On Windows, wchar_t is ubiquitous. Windows Unicode-enabled API functions use UTF-16 (wide character) encoding, which is used for native Unicode encoding on Windows operating systems. Windows Data Types for Strings http://msdn.microsoft.com/en-us/library/windows/desktop/dd374131.aspx
Comment 9 by Edward.Casey.Adams:
Perhaps Cgo users should link to libiconv (http://www.gnu.org/software/libiconv/) instead? The problem is that both the width and the unicode encoding for wchar_t is not well defined. (See http://en.wikipedia.org/wiki/Wide_character#C.2FC.2B.2B) For example, on Windows/Visual Studio platforms, wchar_t is 16 bits wide and encoded in UTF-16LE, whereas most linux distros wchar_t is defined to be 32 bits wide, but most unicode is in UTF-8 stored in regular chars and most anything else won't be little-endian. Thus adding C.WcharString() adds ambiguity.
I once made this package: https://github.com/GeertJohan/cgo.wchar It works well, but requires libiconv. I have never tested it on anything except linux.
The problem with comment #10 is that you would either a) need to know what the definition of wchar_t is on the target platform b) use the mbtowc() family of functions - which requires you to know what the multibyte encoding is If we can guarantee that all systems supported by Go have a multibyte encoding of UTF-8, then we can implement this portably. Alas: $ uname -a Linux pietro-laptop 3.13.0-29-generic #52-Ubuntu SMP Wed May 28 12:42:47 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux $ cat multibyte.c #include <stdio.h> #include <stdlib.h> #include <limits.h> #include <string.h> #include <errno.h> #include <locale.h> int main(void) { wchar_t wide = L'世'; char multibyte[MB_LEN_MAX]; int i, n; setlocale(LC_ALL, ""); errno = 0; n = wctomb(multibyte, wide); if (n == -1) { fprintf(stderr, "error %s\n", strerror(errno)); return 1; } if (n == 0) { fprintf(stderr, "weird: wctomb() returned 0 (no bytes in output)\n"); return 2; } for (i = 0; i < n; i++) printf("%02X ", multibyte[i]); printf("\n"); return 0; } $ LC_CTYPE= ./a.out FFFFFFE4 FFFFFFB8 FFFFFF96 $ LC_CTYPE=en_US.UTF8 ./a.out FFFFFFE4 FFFFFFB8 FFFFFF96 $ LC_CTYPE=ja_JP.SJIS ./a.out FFFFFF90 FFFFFFA2 So as far as I can gather, a C.CWString() would need to be platform-specific. For Windows, we can either - do the work on the Go side: have unicode/utf16 do the conversion (this is what package syscall does) - do the work on the C side: use MultiByteToWideChar() in kernel32.dll by passing CP_UTF8 as the first argument (which should work regardless of locale) For the Unixes, though, I'm not sure... other than linking to libiconv, which I imagine isn't optimal, or flat out not providing it since it isn't used much to begin with, in which case for Windows we could just say use the routines in package syscall. (I have wanted to prune through cgo myself sometime.)
C99 and later specify that if __STDC_ISO_10646__ is defined, then wchar_t characters have value equal to their Unicode code point. We could conditionally provide/expose C.WcharString() (or C.CWString() or whatever) only if the C compiler defines that macro, and then I don't think we need to rely on any external libraries like libiconv. I think the only nit would be how to handle code points greater than WCHAR_MAX. ISO C doesn't specify how to handle that case, but in practice it seems like encoding characters using UTF-{8*sizeof(wchar_t)} should work. Varying the implementation depending on sizeof(wchar_t) might be a tad involved, but nothing really out of the ordinary from what cgo already has to do I think.
Hm, at least GCC (4.8.2) on Ubuntu 14.04 defines it: $ echo | gcc -E -dD - | grep STDC_ISO_10646 #define __STDC_ISO_10646__ 201103L (Seems to come from /usr/include/stdc-predef.h, provided by glibc.) But indeed GCC 4.6.3 on Ubuntu 12.04 or even just Clang 3.5 on Ubuntu 14.04 do not, so that's unfortunate.
Oh, older glibc define __STDC_ISO_10646__ in <features.h>, which then gets pulled in by other glibc headers like <wchar.h>, but won't be provided by default or by GCC provided headers like <stddef.h>. But I suppose it's still not a very worthwhile signal unless Windows and OS X also define it.