kkaempf / swig-issues

Issues from SWIG (testing)
0 stars 0 forks source link

C#: wchar_t should be marshalled as UnmanagedType.U2 #91

Open SwigAtSF opened 11 years ago

SwigAtSF commented 11 years ago

wchar_t should be marshalled as UnmanagedType.U2, not the default of 1 byte.

SwigAtSF commented 11 years ago

Patch (I tested passing wchar_t, but not returning it)

SwigAtSF commented 11 years ago

Logged In: YES user_id=14972 Originator: NO

On Linux at least, sizeof(wchar_t) is 4, so U2 will truncate characters outside the BMP. That's better than the current situation, but should this actually be U4?

SwigAtSF commented 11 years ago

Logged In: YES user_id=171344 Originator: YES

sizeof(wchar_t) is 4??? That's amazing to me. What a waste of memory. I'll be sure not to call my wide characters "wchar_t" if I get around to coding on Linux.

Unfortunately, U4 doesn't work on Win32; the .NET framework throws an exception with a message saying 'char' can only marshal as U1, U2, I1 or I2.

SwigAtSF commented 11 years ago

Logged In: YES user_id=171344 Originator: YES

Here's an idea: perhaps wchar_t should be marshalled as a plain int in the PINVOKE class, and the two wrappers can convert between char and wchar_t on each end.

SwigAtSF commented 11 years ago

Logged In: YES user_id=14972 Originator: NO

A 32 bit type is the narrowest available integer type which can hold the full Unicode range, so I guess that's why it was chosen. Unicode as wide characters inevitably is wasteful if you don't actually use that range. Restricting to the BMP and using a 16 bit type is wasteful if you only have English text. If you only want upper case letters and 6 other characters, you only need 5 bits, so 8 bits per character is wasteful!

Anyway, using a plain int sounds reasonable to me, but I don't really know the innards of C# - William's your man for that.

SwigAtSF commented 11 years ago

Logged In: YES user_id=171344 Originator: YES

Unicode as 32-bit ints is inevitably wasteful even if you use the full Unicode range, which is only 20 bits. Further, AFAIK all "living language" characters fit in 16 bits. If wchar_t were only used to represent single characters it would be fine, but for strings (wstring) it's very wasteful. I guess it's better to use std::basic_string rather than std::wstring, although on Windows wchar_t may be better because a debugger understands that it represents characters.

As for whether to use 16-bit or 8-bit strings, it's a no-win situation. UTF-8 is inefficient for representing languages like Chinese, while UTF-16 is inefficient for European languages. And the minimum addressing boundary of our computers is 8 bits, so I'm afraid 5-bit character strings are out :P

SwigAtSF commented 11 years ago

Logged In: YES user_id=242951 Originator: NO

A test case demonstrating the fix would be appreciated before submitting the patch.