mettli / guichan

Automatically exported from code.google.com/p/guichan
Other
0 stars 0 forks source link

Lack of Unicode support becuase of custom ASCII conversion table. #15

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Using library in The Mana World project with sdlinput and 
SDL_EnableUNICODE(1);
2. Typing doesnt register all Unicode characters
3. TextField doesnt display all Unicode characters

Dont limit input to ASCII characters its bad.
http://www.libsdl.org/cgi/docwiki.cgi/SDLKey

GNU/Linux Debian (sid) - guichan 0.7.1

Original issue reported on code.google.com by mateusz....@gmail.com on 15 Sep 2007 at 9:06

GoogleCodeExporter commented 9 years ago
So, as posted on the mailing list, the patch to SDL input to allow unicode is 
pretty
small:

int SDLInput::convertKeyCharacter(SDL_Event event):

-        int value = 0;
-
-        if (keysym.unicode < 255)
-        {
-            value = (int)keysym.unicode;
-        }
+        int value = keysym.unicode;

However, gcn::TextBox and gcn::TextField would also require changes to make 
sure the
cursor position isn't in the middle of a multi-byte unicode character (changes
available in TMW source). But maybe this should be some kind global option 
somewhere,
in order not to bother people who are not using unicode.

Rendering unicode strings correctly is up to the font class.

Original comment by b.lindeijer on 18 Feb 2008 at 10:54

GoogleCodeExporter commented 9 years ago
For this to work correctly, the enum in key.hpp will have to be changed, since
special keys like shift, alt, ctrl have id's that start at 1000, which overlaps 
some
unicode characters. 

Original comment by final...@gmail.com on 18 Feb 2008 at 11:02

GoogleCodeExporter commented 9 years ago
Ah true, our SDLInput class uses a copy of that enum with LEFT_ALT starting at 
-1000
instead of 1000, to avoid conflicts with higher character codes.

Original comment by b.lindeijer on 18 Feb 2008 at 11:13

GoogleCodeExporter commented 9 years ago
Hi,

I just finished my UTF-8 solution for guichan.

All widgets that display texts without any manipulation (window, button, ...) 
may
simply use an UTF-8 aware fonts. TextBox and TextField are not one of those 
widgets,
so I had created this package.

This package contains:
- UTF-8 version of TextField (UTF8TextField)
- UTF-8 version of TextBox (UTF8TextBox)
- UTF8StringEditor - helper class for manipulating UTF-8 strings
- SDLUTF8TrueTypeFont - Extended SDLTrueTypeFont (from guichan addons dir)
- key.diff - solves the issue reported by finalman applying b.lindeijer 
solution :)
- utf8 template library from http://utfcpp.sourceforge.net/

I didn't test the SDLUTF8TrueTypeFont class, I use an other UTF-8 SDL_ttf 
solution in
my project, but it was too depended off other stuff, so I simply modified
SDLTrueTypeFont (getStringIndexAt is from my original class, so it should work).

I made UTF8StringEditor an external class, so more widgets may use it and 
because
later, you could write more obscure string editors based on std::string (even 
for
encodings like UCS-4 which use fixed 32bit integers for storing single 
character).

The screenshot attached shows some international characters displayed with
DejaVuSans. I have no idea what the texts means, I just copied some random 
characters
from http://dejavu.sourceforge.net/wiki/index.php/Testing, so please don't 
blame me
if it is something insulting :)

I hope this code will find itself in guichan in a future release, but until 
then, you
can simply use this package (remember to apply the patch to guichan if you 
intend to
use unicode characters >= 1000).

I think you may find it useful, if so, I would appreciate some comments.

Original comment by nexat...@gmail.com on 24 Mar 2008 at 2:55

Attachments:

GoogleCodeExporter commented 9 years ago
It seems that you are on to something here. I like the fact that non of the 
original
widgets needs to change, that you use specific UTF8 aware widget instead when 
it's
needed.

I also like the fact that an UFT8 package can be isolated from the rest of 
Guichan,
perhaps in it's own namespace under the gcn namespace.

However, the use of a plain std::string seems to me a bit risky. I mean the 
string
itself can't be used as a regular std::string. Perhaps a better approach would 
be to
let the UTF8, or unicode, aware widgets work on an abstract string 
implementation
that looks like an ordinary string but works with different encodings. If you 
want to
change the encoding you don't change the stringeditor (it's not needed), you 
simply
pass another instance of the abstract gcn::string to the widgets. The abstract 
string
could use a fixed bit length (say 16 bits per character even though some 
characters
in fact are made of 8 bits) so random access works properly. 

The core widgets that need to edit text or display text could also be changed 
so they
better abstract the way strings are used making it easier to implement unicode 
aware
widgets that use string that's not an std::string.

Original comment by olof.nae...@gmail.com on 24 Mar 2008 at 11:00

GoogleCodeExporter commented 9 years ago
I love UTF-8 because it is ASCII/Latin1 compatible. I used before ISO-8859-2, 
but
problems with character tables where constant!

UTF-8 was invented to allow older applications (using 8 bit integers as 
character
cell) to use UNICODE characters with minimum to no modification to existing 
software.
Since most of the widgets will transparently work with UTF-8, why should 
anything be
changed?

From guichan about page:

"" Guichan is a small, efficient C++ GUI library designed for games """

Since guichan is made for games, how many text fields a game may have ? Only a 
few,
probably in high score editor and maybe in options (unless you are making a 
Space
Empires clone) and ofcourse the mana world login/registration/character setup. 
If
some one is using single byte character sets, they may use ImageFont with all
widgets. If some one needs real UNICODE support, he uses UTF-8 versions of 
widgets
where required. Performance costs are minimum for a few widget rarely used, also
remember that SDL_ttf uses UNICODE internaly (I think UCS-2), so when you 
render a
Latin1 string it is also converted to UNICODE.

There is also a std::wstring class, nice, but same as wchar_t, it's 16bit wide 
on
windows and 32bit wide on linux. MinGW, which is very popular under windows, 
doesn't
have std::wstring. Also, UTF-8 looks way better in C++ source code and doesn't 
need
special handling from C++ compiler.

Such abstract string whould have to use UCS-4 for example, and then be 
converted to
anything the user wants with iconv or something similar, but how whould the 
source
code look ?.

Oh, and if I remember correctly, wxWidgets uses UTF-8 internaly!

Original comment by nexat...@gmail.com on 24 Mar 2008 at 11:38

GoogleCodeExporter commented 9 years ago
The derivations from the abstract string class can use any type of encoding. It 
would
be completely transparent to the widgets. The only thing that needs to know 
about the
encoding is the font. 

Personally I don't really care about the number of text fields or text areas, 
it's
totally irrelevant. The big issue here is that it needs to be easy to use, it 
has to
scale well, it should require as little change to Guichan as possible and it 
should
not break the usage of plain old ascii std::strings.

I can't tell you how the source code would look like, but it would be an
implementation of a string that's aware of it's encoding, much like your
StringEditor. I think going with the Java approach making an UTF-16 
implementation 
is a good start. That way users who need unicode can use a special UTF-16 
string and
unicode aware widgets, and users who don't need unicode can stick with 
std::string
and the normal non unicode aware widgets.

If the core widgets are changed so that they always use virtual methods for 
accessing
and drawing strings then much of the code can be reused. Unicode aware widgets 
could
simply inherit protected from a core widget and reuse most of the core widget's 
code,
only substituting the way text is handled.

Original comment by olof.nae...@gmail.com on 24 Mar 2008 at 12:30

GoogleCodeExporter commented 9 years ago
UTF-8 doesn't have any problems with byte endianess since it uses 8bit code 
sequences
(I don't realy know why UTF-8 has support for BOM).

On the other hand, UTF-16 is also variable length character encoding. A single
character may take one 16bit integer or two 16 bit integers, so it doesn't 
solve the
requirement for iterating through the string to get character position from byte
position!

UTF-8 advantages:
- It's a good standard, all C++ compiles will handle it transparently.
- Most editors can handle UTF-8 source files.
- No special magic is required in source code to create UTF-8 constants.
- SDL_ttf can not handle UTF-16 string, only Latin1, UTF-8 and UCS-2
- Allegro, by default assumes all strings are in UTF-8

I understand having a string class like utf8::string would be nice, but I 
couldn't
find one anywhere, and I am not about to write such freak (I am not an expert 
on STL,
I made a doxygen html help for GNU STL implementation, it's 36MB in size!!!!). 
Don't
you think it is out of the scope of guichan ?

I think making TextField and TextBox in form of templates could handle the 
situation.

typedef TTextBox<std::string> TextBox; // for backward compatibility

This way, TextBox whould still work for everyone using it today. My UTF8TextBox 
will
still work and could be used until a brave soul writes utf8::string, 
utf16::string
ucs32::string or whatever.

However, I still think UTF-8 is the most portable solution to handle
internationalization in the least painfull way.

Or maybe it is time to start a new project called Portable Unicode C++ string 
that
will implement strings in a way similar to Python or Java strings.

Anyway, what do I have to do with this package so it may go to guichan/addons ?

Original comment by nexat...@gmail.com on 24 Mar 2008 at 1:14

GoogleCodeExporter commented 9 years ago
I think the small changes required for UTF-8 support in TMW and these classes by
Przeme show that there is really no need to abstract away std::string just to 
have
UTF-8 support. The only thing you need is some helper functions that allow you 
to
modify the string, calculate the length in characters, etc.

I like the implementation by Przeme and look forward to seeing these classes
available in Guichan. I also think it's a good idea to base the UTF-8 support on
UTF8-CPP. I've got two small remarks on his code:

* Style: Please don't use tabs, Guichan uses 4 spaces to indent code.
* Efficiency: Couldn't UTF8StringEditor::insertChar use utf8::append on an empty
std::string using the std::back_inserter (noted in UTF8-CPP docs for 
utf8::append),
and then then use the normal std::string insert method to insert the new 
unicode part
at the requested index?

Original comment by b.lindeijer on 24 Mar 2008 at 10:58

GoogleCodeExporter commented 9 years ago
All the UTF8 aware widgets inherits publicly from a core widget. Isn't it 
better with
a protected inheritance revealing only the valid functions or have a public
inheritance overloading all functions that deal with text? If you use the 
UTF8TextBox
I don't really see the point in having setCaretColumn available (or other 
methods
that deal with non UTF8 strings) as only setCaretColumnUTF8 should be used (and 
other
methods that deal with UTF8 strings).

Also you still have some problems adapting to our code conventions. We always 
use a
new line for all brackets. We don't have shortened names for methods in 
Guichan, we
try to keep our code consistent so it will be easier to use. 

Another thing, isn't the parameter byteOffset in all of the StringEditor 
functions
more like a characterOffset? Perhaps naming the parameter simply offset will 
make
it's use clearer.

I think your code could be added as an add on. I'm planning on incorporating 
the add
ons into the main source code, but under another namespace so add ons will be 
easier
to spot in the future. 

Original comment by olof.nae...@gmail.com on 25 Mar 2008 at 6:09

GoogleCodeExporter commented 9 years ago
I use tabs with size set to 4. I will change them to spaces.
b.lindeijer, I'm not STL guru, but I se it's time to start learning. I will try 
to
implement insertChar with back_inserter.

"""
All the UTF8 aware widgets inherits publicly from a core widget. Isn't it 
better with
 a protected inheritance revealing only the valid functions or have a public
inheritance overloading all functions that deal with text? If you use the 
UTF8TextBox
I don't really see the point in having setCaretColumn available (or other 
methods
that deal with non UTF8 strings) as only setCaretColumnUTF8 should be used (and 
other
methods that deal with UTF8 strings).
"""

UTF8 version of those methods could be avoided if Caret functions where virtual.

I will modify the string editor so offset will be always byte offset while 
index will
be character index.

I will upload the updated code later today.

And my name is Przemek :)

Original comment by nexat...@gmail.com on 25 Mar 2008 at 10:07

GoogleCodeExporter commented 9 years ago
It shouldn't matter if they are virtual or not, if someone casts an UTF8TextBox 
to a
TextBox well then they probably know what they are doing. Of course, making the
functions virtual would make it possible to perform such a cast and use the 
TextBox.

The functions aren't virtual as they might be called from a constructor. But 
perhaps
they could be made virtual. Anyway, I still think you should go with overloading
methods than adding new ones and letting the old ones that can't be used laying 
around. 

Original comment by olof.nae...@gmail.com on 25 Mar 2008 at 10:36

GoogleCodeExporter commented 9 years ago
Now I see using back_inserter makes more sense and the code is more clear.

I hope everything is ok now.

Since guichan doesn't implement a clipboard nor selections in textbox and 
textfield,
leaving original caret manipulation functions may be useful, because someone 
may want
to modify the text without UTF-8 knowledge:

void ctrlCPressed() {
   TextField* myTextField = getUTF8TextField();
   int caret = myTextField->getCaretPosition();
   std::string x = myTextField->getText();
   x.insert(caret, clipboard);
   myTextField->setText(x);
   myTextField->setCaretPosition(caret + clipboard.size());
}

Original comment by nexat...@gmail.com on 25 Mar 2008 at 11:27

Attachments:

GoogleCodeExporter commented 9 years ago
> It shouldn't matter if they are virtual or not, if someone casts an 
UTF8TextBox to a
TextBox well then they probably know what they are doing.

Actually you don't need to cast an UTF8TextBox to a TextBox, you could simply 
assign
it. But everybody should of course know what they are doing.

If somebody instanciates an UTF8TextField instead of a normal TextField, I 
think they
should realize that the std::string you get/set _must_ be a valid UTF-8 encoded
string. As such, the suggested implementation of ctrlCPressed is completely 
wrong in
my opinion. If somebody wants to write code like that, he shouldn't be using an
UTF8TextField.

So anyway I would prefer the appropriate methods to be made virtual, so that 
they can
be overridden with proper UTF-8 behaviour. I really don't want to need to 
bother with
functions named like setCaretColumnUTF8 (I hadn't noticed the methods were 
called
like this before).

Original comment by b.lindeijer on 26 Mar 2008 at 8:04

GoogleCodeExporter commented 9 years ago
ctrlCPressed() method is correct. UTF8TextField & UTF8TextBox always returns 
caret
positions in correct places before a character or after the last character. So
inserting other UTF-8 string on caret position is valid, just like moving the 
caret
by the size of the inserted text.
This is an ugly solution, but still, the behaviour will be as expected.

Original comment by nexat...@gmail.com on 26 Mar 2008 at 8:21

GoogleCodeExporter commented 9 years ago
Ah, you're right. And now I see where you're coming from since the code that 
sets the
caret position would break if setCaretPosition was an UTF-8 aware method (since 
it
gets a byte index and would treat it as a character index).

Now I'm not so sure anymore, maybe we should have both versions available...

Original comment by b.lindeijer on 26 Mar 2008 at 8:33

GoogleCodeExporter commented 9 years ago
I am sure both versions should be avaible. The question is, which method should
return character index and which the byte index.

I think the Caret position places the caret at specified character index, so 
maybe
instead of getCaretPositionUTF8, there should be getCaretPositionByte(). This 
would
require virtual methods in TextField and TextBox (for set/get caretPosition and
caretColumn).

Original comment by nexat...@gmail.com on 26 Mar 2008 at 9:59

GoogleCodeExporter commented 9 years ago
It might be good to keep both methods, the UTF8 string ones and the original 
ones,
but I still don't like the public inheritance, it has to be changed if this is 
to be
added to Guichan.

I want the UTF8 methods to skip the UTF8 suffix (as a user should be safe in 
assuming
an UTF8 widget works on UTF8 strings) and that new methods are added that works 
like
the original ones where their purpose are clearly stated with good names and 
good
documentation.

Remember, getCaretPosition should only return the position of the caret in the
string. Now a caret's position is determined by the number of characters before 
the
caret. The name implies it should have _nothing_ to do with UTF8. We don't want 
to
confuse people, like Björn :) A method that takes UTF8 into consideration 
should be
added, but with another name, perhaps getCaretPositionByte that explains it 
doesn't
return a normal caret position. I think getCaretPositionInUTF8Bytes is even 
better.

If someone wants to implement a clipboard like your suggestion then they use the
setCaretPosition method completely wrong as you have to take UTF8 into account, 
with
your implementation it's in my opinion just a coincidence it works as the
implementation of setCaretPosition is wrong. 

Original comment by olof.nae...@gmail.com on 26 Mar 2008 at 4:38

GoogleCodeExporter commented 9 years ago
I propose to add to TextField virtual getCaretBytePosition and for TextBox get
getCaretByteColumn (and the setters). That way, everybody will know that
CaretPosition use character index and CaretBytePosition use byte index. Then, my
clipboard example will always work the same for Latin, UTF-8, UTF-16 and UTF-32 
as
long as the clipboard and the edited text are both in the same encoding (not 
counting
base64, utf-7 and alike).

An alternative solution could be to expose StringEditor which already have 
getOffset
and countCharacters. However it will only work for UTF8TextBox/Field since it 
doesn't
have.

I think you can't decide what I18n policy you want to implement in guichan.

For future releases (like v1.0.0), the throuth is that using my StringEditor 
class
(modfied to remove ALL raw string manipulations I left in TextBox and TextField
because of possible performance losses) is the best solution. Then, all you 
need is
to provide 2 string editors in guichan core:
ByteStringEditor/LatinStringEditor/ASCIIStringEditor (or anything else you 
prefere)
and my UTF8StringEditor. That way, user will be able to manipulate any type of
strings inside text boxes and in code. This makes everything easyer. That way, 
there
is only one TextBox class and one TextField class. If you think UTF-32 is 
better, you
write UTF32StringEditor and every textbox and text field may use UTF32 strings
(casting from (char*) to (Uint32*) isn't that hard :) )

This is my clipboard example using StringEditor:

void ctrlCPressed() {
   TextField* myTextField = getTextFieldWithSomeStrangeEncoding();
   int caret = myTextField->getCaretPosition();
   StringEditor *editor = myTextField->getStringEditor();

   std::string x = myTextField->getText();
   editor.insert(text, clipboard, caret);
   myTextField->setText(x);
   myTextField->setCaretPosition(caret + editor->getLength(clipboard));
}

Something like string editors is also implemented in PHP. You have multibyte 
string
functions which operates on byte strings:

$myExoticTextLength = mb_strlen($string, "EUC-JP");
$myPolishTextLength = mb_strlen("ąźćśłółŹŻżĘ", "UTF-8");

I understand this is not the best solution for an operating system GUI, but for 
a
game gui, this is much more than most other toolkits provide.

Original comment by nexat...@gmail.com on 26 Mar 2008 at 5:37

GoogleCodeExporter commented 9 years ago
Should I make a patch for TextBox and TextField to support virtual carret 
methods ?

Original comment by nexat...@gmail.com on 28 Mar 2008 at 7:33

GoogleCodeExporter commented 9 years ago
If you could get the whole thing in one patch, that would be welcome. If you 
want
added files to show up in 'svn diff', you'll have to 'svn add' them first.

Original comment by b.lindeijer on 1 Apr 2008 at 9:29

GoogleCodeExporter commented 9 years ago
Please look at Glib::ustring for an implementation of a utf-8 
std::string-compatible
class.  In essence, from a user's point of view the only major change is that 
the
indices become per-character instead of per-byte.

Original comment by douso...@gmail.com on 15 Jul 2008 at 11:10

GoogleCodeExporter commented 9 years ago
Disregarding any further unicode support, the keycodes starting with 1000
inhibit any custo, implementation of unicode support.

Please change that to: "LEFT_ALT  = -1000" - that's good enough for now.

Original comment by klaus.bl...@web.de on 21 Mar 2009 at 3:38

GoogleCodeExporter commented 9 years ago
Hello. I have never used Guichan but want to add a comment:
Your library looks good, but the fact that it does not support unicode made me 
not
test it or use it for my game.
I am just "one" person, but I want to tell it is important for me.
Good luck

Original comment by christia...@gmail.com on 5 Jun 2010 at 9:46

GoogleCodeExporter commented 9 years ago
I feel it's my fault, 2 years ago I was supposed to send a patch...

Original comment by nexat...@gmail.com on 7 Jun 2010 at 6:14

GoogleCodeExporter commented 9 years ago
Hi!

Any chance to get some kind of solution for this problem anytime soon? Custom 
incompatibly patches guichan forkes creeping everywhere which isn't really a 
nice situation

Original comment by siccegge...@gmail.com on 14 Jun 2011 at 6:29

GoogleCodeExporter commented 9 years ago
The current version of Guichan will always be ASCII only.

Original comment by olof.nae...@gmail.com on 14 Jun 2011 at 7:50

GoogleCodeExporter commented 9 years ago
For openSUSE I've decided to update to 0.8.2 and apply the unicode patch, which 
I hope it will be maintained in the future. Thanks for enabling this feature, 
even if we have to run away from upstream, but for sure it's a most welcome one.

Original comment by nmo.marques on 7 Oct 2011 at 5:41