suzdraws / mintty

Automatically exported from code.google.com/p/mintty
0 stars 0 forks source link

East Asian ambiguous character width #88

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This patch adds "CJK width" option that supports "East Asian Ambiguous
Character Width". ( http://unicode.org/reports/tr11/ )

When this option is checked, the width of some Unicode characters is 2.

This patch:
1. Add new option "CJK width" to MinTTY Options dialog.
2. Markus Kuhn's wcwidth.c separated from unicode.c.
3. wcwidth.c upgraded to the 2007-05-26 (Unicode 5.0) edition.
4. Some patches are applied to wcwidth.c.

Original issue reported on code.google.com by deenhe...@gmail.com on 15 Apr 2009 at 3:01

Attachments:

GoogleCodeExporter commented 9 years ago
The PuTTY manual has this to say about this setting:

"There are some Unicode characters whose width is not well-defined. In most 
contexts,
such characters should be treated as single-width for the purposes of wrapping 
and so
on; however, in some CJK contexts, they are better treated as double-width for
historical reasons, and some server-side applications may expect them to be 
displayed
as such. Setting this option will cause PuTTY to take the double-width 
interpretation.

If you use legacy CJK applications, and you find your lines are
wrapping in the wrong places, or you are having other display
problems, you might want to play with this setting."

I'm not keen on inflicting such an option on users. If I understand this 
correctly, a
whole lot of non-legacy apps will break if this is activated? Surely then, the 
right
thing to do is to fix or replace the legacy apps, rather than use such an ugly
workaround?

What sort of applications are actually affected by this anyway? Anything that's 
part
of the Cygwin distribution? And how do other terminal emulators deal with this 
issue?

Original comment by andy.koppe on 15 Apr 2009 at 3:32

GoogleCodeExporter commented 9 years ago
This problem is NOT legacy!!

> What sort of applications are actually affected by this anyway?

ALL the applications that control terminal are.
For example, shells (bash, zsh, tcsh, ...), editors (emacs, vim, ...), viewers 
(less,
lv, lynx, w3m, ...), and so on.

> And how do other terminal emulators deal with this issue?

Most terminal emulators that can use Unicode have the CJK Width switch as long 
as I
know. (xterm, mlterm, and putty, etc)

> The PuTTY manual has this to say about this setting:

I do not think that the author of PuTTY manual correctly understands the 
problem. 

It is not a problem this is whether the application is a legacy.

There are a lot of problems:

- The terminal application does not know the size of the character displayed on 
the
terminal emulator.

- The width of the character changes by the selected font. For example, in the 
"MS
Gothic" font, the width of "Greek Small Letter Alpha" is twice "Laten Small 
Letter
A". But, in the "Consolas" font, both characters are the same width.

- The text processing functions (like "wcwidth") cannot handle that the width of
character changes dynamically.

- There is an application that needs information on the width of the character
regardless of the terminal. For example, text formatter.

And, we have neither the standard nor the protocol to solve these problems.

Original comment by deenhe...@gmail.com on 15 Apr 2009 at 6:41

GoogleCodeExporter commented 9 years ago
Admittedly I don't really know what I'm talking about here, but you haven't 
convinced
me. Here's xterm's take on this:

-cjk_width
    Set the cjkWidth resource to ''true''. When turned on, characters with East Asian
Ambiguous (A) category in UTR 11 have a column width of 2. Otherwise, they have 
a
column width of 1. This may be useful for some legacy CJK text terminal-based
programs assuming box drawings and others to have a column width of 2. It also 
has to
be turned on when you specify a TrueType CJK double-width (bi-width/monospace) 
font
either with -fa at the command line or faceName resource. The default is 
''false'' 

So since a width of 1 is the default in xterm, I assume that all those standard
programs actually work correctly with that width (and a matching font)? And that
there really is a legacy problem with some applications that expect a width of 
2? Do
you have any examples of those?

If you set the ambiguous width to 2 for MS Gothic in the terminal, I can't see 
how
that's going to work with non-legacy applications, which assume ambiguous 
characters
to have width 1? Shouldn't the "MS Gothic" font therefore be considered legacy, 
and
the likes of "Consolas" be used instead?

(Also, what does Greek alpha have to do with the East Asian Ambiguous category?)

Original comment by andy.koppe on 15 Apr 2009 at 8:28

GoogleCodeExporter commented 9 years ago
In Japanese,
In the character (sign of the ruled line etc.) that corresponds to ambiguous 
width,
all are wide width. 
(Traditionally, only ASCII and the one-byte katakana are narrow width. )
Therefore, the application that uses curses is awful. 
For instance, like this. 
|
 |
  |

I am making a pertinent part of the locale wide width in Linux. 
When this option can be adopted, I am welcome because it has not been equipped 
fully
with the locale in cygwin yet. 

Original comment by oustt...@gmail.com on 16 Apr 2009 at 5:28

GoogleCodeExporter commented 9 years ago
This is a problem of depending on not only the application but also the font. 
I hope the terminal emulator (like MinTTY) draws the character at the position
expected in the selected font.

For example, 00example.txt is UTF-8 text file that assumes the display using a
Japanese font on the terminal emulator. 
I expect that it is displayed as 01GOOD.png.
However, current MinTTY displays this as 02BAD.png. 

> Shouldn't the "MS Gothic" font therefore be considered legacy, and the likes 
of
"Consolas" be used instead?

"MS Gothic" is NOT legacy. It is most popular Japanese font. And ALL fixed pitch
Japanese fonts have the same problem.

> (Also, what does Greek alpha have to do with the East Asian Ambiguous 
category?)

In fact, "East Asian Ambiguous category" means the part where the character set
standard of Europe and America and the character set standard of CJK come in
succession. (ASCII is excluded)

Original comment by deenhe...@gmail.com on 16 Apr 2009 at 3:22

Attachments:

GoogleCodeExporter commented 9 years ago
deenheart, thank you very much for that example. I'm thoroughly confused about 
this
though, so I'll have to read the Unicode report you linked to properly.

Meanwhile, more questions:
- So there are two sets of ambiguous-width characters: a subset of actual CJK
characters, and also non-CJK non-ASCII characters such as line drawings and 
Greek
letters. Am I right to assume that both these sets should always have the same 
width?
- Is there any sort of movement towards the one-column characters in Japanese? 
(In
other words: why is the two-column option called "legacy" in xterm and PuTTY?)
- How does ncurses deal with these characters?
- Could the correct width be picked by looking at the selected font, without
bothering the user?
- In the attached PNG, "Lucida Console" is used to display the example text. 
The CJK
characters take up two character cells, even though the glyphs themselves are 
only
one column wide. Is mintty rendering them incorrectly, i.e. should they only 
take up
one column?

Original comment by andy.koppe on 21 Apr 2009 at 5:14

Attachments:

GoogleCodeExporter commented 9 years ago
Right, I think I finally get it: the "East Asian Ambiguous" category doesn't 
actually
contain any East Asian characters. Instead, it contains characters such as 
Greek and
Cyrillic ones that are rendered as halfwidth (i.e. one-column) characters in 
non-East
Asian usage, but as fullwidth (i.e. two-column) characters in East Asian usage.
"Legacy" doesn't refer to the ambiguous width issue, but to pre-Unicode 
character sets.

Some of the questions remain though:
- How does ncurses deal with these characters?
- Could the correct width be picked by looking at the selected font, without
bothering the user?
- In the attached PNG, "Lucida Console" is used to display the example text. 
The CJK
characters take up two character cells, even though the glyphs themselves are 
only
one column wide. Is mintty rendering them incorrectly, i.e. should they only 
take up
one column?

Original comment by andy.koppe on 21 Apr 2009 at 8:29

GoogleCodeExporter commented 9 years ago
> the "East Asian Ambiguous" category doesn't actually contain any East Asian 
characters.
(snip)
> "Legacy" doesn't refer to the ambiguous width issue, but to pre-Unicode 
character sets.

Yes, that's right.

> - How does ncurses deal with these characters?

ncurses compiled with '--enable-widec' option depends on wcwidth().
But, Cygwin's wcwidth() is broken. It returns 1 to all characters.
(I am trying to fix it...)

> - Could the correct width be picked by looking at the selected font, without
bothering the user?

In most cases, yes, it can.
However, I uncommonly want to adjust ambiguous character width to 1.
Because, the applications that do not correctly handle the width of the 
character is
not a little.

> - In the attached PNG, "Lucida Console" is used to display the example text. 
The
CJK characters take up two character cells, even though the glyphs themselves 
are
only one column wide. Is mintty rendering them incorrectly, i.e. should they 
only
take up one column?

On the terminal emulator, We (= CJK language users) expect that the width of 
the CJK
character is a twice the width of the alphabet, and that the aspect ratio of 
the CJK
character is 1:1.

In the PNG, I think that the sizes of the CJK characters are too small.

However, because the aspect ratio of "Lucida Console" is not 2:1 
(height:width), I
think that it is difficult to display the CJK characters correctly.

Original comment by deenhe...@gmail.com on 22 Apr 2009 at 1:55

GoogleCodeExporter commented 9 years ago
So no one would want CJK characters with width 1, as in the attached cjk1.png?
Actually they seem to be a bit wider than once cell in Lucida Console, as shown 
in
the attached lucida.png screenshot from an editor. Do you know why that is, 
given
that Lucida Console is meant to be a monospace font? It's the same for Courier 
New.

Back to the ambiguous CJK category though. I'll implement an automatic scheme 
based
on looking at the width of Greek Alpha in the selected font. If you want to 
switch
ambiguous CJK width, you'll need to select an appropriate font. This seems a 
better
solution than forcing glyphs with the wrong width into a cell, such as with the
squashed Greek characters from MS Gothic. (If that's not sufficient, I might 
consider
a control sequence for overriding the automatic detection.)

Original comment by andy.koppe on 23 Apr 2009 at 6:48

Attachments:

GoogleCodeExporter commented 9 years ago
Implemented font-based handling of ambiguous character width in r240 on trunk.
wintext.c already had a variable called "font_dualwidth", which seems to do the 
job.

Original comment by andy.koppe on 23 Apr 2009 at 9:32

GoogleCodeExporter commented 9 years ago
Darn, font_dualwidth is unreliable. This breaks Courier New.

Original comment by andy.koppe on 23 Apr 2009 at 9:42

GoogleCodeExporter commented 9 years ago
Fixed font_dualwidth problem in r241.

Original comment by andy.koppe on 24 Apr 2009 at 4:45

GoogleCodeExporter commented 9 years ago
> So no one would want CJK characters with width 1, as in the attached cjk1.png?

No, it is bad for CJK language user.

> Actually they seem to be a bit wider than once cell in Lucida Console, as 
shown in
the attached lucida.png screenshot from an editor. Do you know why that is, 
given
that Lucida Console is meant to be a monospace font? It's the same for Courier 
New.

The font files for European and American languages (ex. "Lucida Console") don't
include CJK characters. I think that the CJK characters in lucida.png are 
displayed
in "MS Gothic". I think that it is a result of "Font Fallback" and/or "Font 
Linking".
Please see following page:

Globalization Step-by-Step: Fonts
http://msdn.microsoft.com/en-us/goglobal/bb688134.aspx

The base font (ex. "Lucida Console") is designed as fixed pitch font, and the
substituted font (ex. "MS Gothic") is designed as fixed pitch font. However, The
design of the substituted font is not the same as the design of the base font.

> Implemented font-based handling of ambiguous character width in r240 on trunk.

It looks good. Thank you.

Original comment by deenhe...@gmail.com on 24 Apr 2009 at 1:55

GoogleCodeExporter commented 9 years ago
> The font files for European and American languages (ex. "Lucida Console") 
don't
include CJK characters. I think that the CJK characters in lucida.png are 
displayed
in "MS Gothic". I think that it is a result of "Font Fallback" and/or "Font 
Linking".

I see, that makes sense. I'd just assumed that those fonts have been extended to
cover all (or at least most) of Unicode in this globalised age.

Thanks for all your help and patience with my ignorance!

Original comment by andy.koppe on 24 Apr 2009 at 5:30

GoogleCodeExporter commented 9 years ago
Took fix to 0.3 branch in r250.

Original comment by andy.koppe on 24 Apr 2009 at 8:57

GoogleCodeExporter commented 9 years ago

Original comment by andy.koppe on 25 Apr 2009 at 1:51