Open SwigAtSF opened 11 years ago
Patch (I tested passing wchar_t, but not returning it)
Logged In: YES user_id=14972 Originator: NO
On Linux at least, sizeof(wchar_t) is 4, so U2 will truncate characters outside the BMP. That's better than the current situation, but should this actually be U4?
Logged In: YES user_id=171344 Originator: YES
sizeof(wchar_t) is 4??? That's amazing to me. What a waste of memory. I'll be sure not to call my wide characters "wchar_t" if I get around to coding on Linux.
Unfortunately, U4 doesn't work on Win32; the .NET framework throws an exception with a message saying 'char' can only marshal as U1, U2, I1 or I2.
Logged In: YES user_id=171344 Originator: YES
Here's an idea: perhaps wchar_t should be marshalled as a plain int in the PINVOKE class, and the two wrappers can convert between char and wchar_t on each end.
Logged In: YES user_id=14972 Originator: NO
A 32 bit type is the narrowest available integer type which can hold the full Unicode range, so I guess that's why it was chosen. Unicode as wide characters inevitably is wasteful if you don't actually use that range. Restricting to the BMP and using a 16 bit type is wasteful if you only have English text. If you only want upper case letters and 6 other characters, you only need 5 bits, so 8 bits per character is wasteful!
Anyway, using a plain int sounds reasonable to me, but I don't really know the innards of C# - William's your man for that.
Logged In: YES user_id=171344 Originator: YES
Unicode as 32-bit ints is inevitably wasteful even if you use the full Unicode range, which is only 20 bits. Further, AFAIK all "living language" characters fit in 16 bits. If wchar_t were only used to represent single characters it would be fine, but for strings (wstring) it's very wasteful. I guess it's better to use std::basic_string
rather than std::wstring, although on Windows wchar_t may be better because a debugger understands that it represents characters. As for whether to use 16-bit or 8-bit strings, it's a no-win situation. UTF-8 is inefficient for representing languages like Chinese, while UTF-16 is inefficient for European languages. And the minimum addressing boundary of our computers is 8 bits, so I'm afraid 5-bit character strings are out :P
Logged In: YES user_id=242951 Originator: NO
A test case demonstrating the fix would be appreciated before submitting the patch.