C#: wchar_t should be marshalled as UnmanagedType.U2

SwigAtSF commented 11 years ago

Artifact_id: 1797418
Opened: 2007-09-18 22:47:42 +0200
Submitter: David Piepgrass <qwertie@users.sf.net>
Assignee: William Fulton <wsfulton@users.sf.net>

wchar_t should be marshalled as UnmanagedType.U2, not the default of 1 byte.

SwigAtSF commented 11 years ago

Date: 2007-09-18 22:47:43 +0200
From: David Piepgrass <qwertie@users.sf.net>
Added: wchar.i.diff

Patch (I tested passing wchar_t, but not returning it)

SwigAtSF commented 11 years ago

Date: 2007-09-19 16:25:38 +0200
From: Olly Betts <olly@users.sf.net>

Logged In: YES user_id=14972 Originator: NO

On Linux at least, sizeof(wchar_t) is 4, so U2 will truncate characters outside the BMP. That's better than the current situation, but should this actually be U4?

SwigAtSF commented 11 years ago

Date: 2007-09-19 17:13:50 +0200
From: David Piepgrass <qwertie@users.sf.net>

Logged In: YES user_id=171344 Originator: YES

sizeof(wchar_t) is 4??? That's amazing to me. What a waste of memory. I'll be sure not to call my wide characters "wchar_t" if I get around to coding on Linux.

Unfortunately, U4 doesn't work on Win32; the .NET framework throws an exception with a message saying 'char' can only marshal as U1, U2, I1 or I2.

SwigAtSF commented 11 years ago

Date: 2007-09-19 17:15:30 +0200
From: David Piepgrass <qwertie@users.sf.net>

Logged In: YES user_id=171344 Originator: YES

Here's an idea: perhaps wchar_t should be marshalled as a plain int in the PINVOKE class, and the two wrappers can convert between char and wchar_t on each end.

SwigAtSF commented 11 years ago

Date: 2007-09-20 01:15:04 +0200
From: Olly Betts <olly@users.sf.net>

Logged In: YES user_id=14972 Originator: NO

A 32 bit type is the narrowest available integer type which can hold the full Unicode range, so I guess that's why it was chosen. Unicode as wide characters inevitably is wasteful if you don't actually use that range. Restricting to the BMP and using a 16 bit type is wasteful if you only have English text. If you only want upper case letters and 6 other characters, you only need 5 bits, so 8 bits per character is wasteful!

Anyway, using a plain int sounds reasonable to me, but I don't really know the innards of C# - William's your man for that.

SwigAtSF commented 11 years ago

Date: 2007-09-20 21:38:23 +0200
From: David Piepgrass <qwertie@users.sf.net>

Logged In: YES user_id=171344 Originator: YES

Unicode as 32-bit ints is inevitably wasteful even if you use the full Unicode range, which is only 20 bits. Further, AFAIK all "living language" characters fit in 16 bits. If wchar_t were only used to represent single characters it would be fine, but for strings (wstring) it's very wasteful. I guess it's better to use std::basic_string rather than std::wstring, although on Windows wchar_t may be better because a debugger understands that it represents characters.

As for whether to use 16-bit or 8-bit strings, it's a no-win situation. UTF-8 is inefficient for representing languages like Chinese, while UTF-16 is inefficient for European languages. And the minimum addressing boundary of our computers is 8 bits, so I'm afraid 5-bit character strings are out :P

SwigAtSF commented 11 years ago

Date: 2008-03-12 22:51:40 +0100
From: William Fulton <wsfulton@users.sf.net>

Logged In: YES user_id=242951 Originator: NO

A test case demonstrating the fix would be appreciated before submitting the patch.

kkaempf / swig-issues

C#: wchar_t should be marshalled as UnmanagedType.U2 #91