Add functions to get the width in columns of a character

vstinner commented 13 years ago

BPO	12568
Nosy	@malemburg, @loewis, @terryjreedy, @vstinner, @benjaminp, @ezio-melotti, @merwok, @bitdancer, @serhiy-storchaka, @Vermeille, @ishigoya, @bianjp
Files	locale_width.patch width.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-feature', '3.8', 'expert-unicode'] title = 'Add functions to get the width in columns of a character' updated_at = user = 'https://github.com/vstinner' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'none' closed = True closed_date = closer = 'vstinner' components = ['Unicode'] creation = creator = 'vstinner' dependencies = [] files = ['23401', '24773'] hgrepos = [] issue_num = 12568 keywords = ['patch'] message_count = 39.0 messages = ['140376', '140488', '141936', '145497', '145498', '145523', '145535', '145748', '145778', '155223', '155236', '155307', '155313', '155323', '155324', '155337', '155342', '155343', '155344', '155345', '155346', '155361', '155370', '155373', '155379', '155382', '156337', '156348', '181149', '238425', '255421', '297129', '297489', '297492', '297564', '297569', '298322', '323731', '329488'] nosy_count = 19.0 nosy_names = ['lemburg', 'loewis', 'terry.reedy', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'eric.araujo', 'Arfrever', 'r.david.murray', 'inigoserna', 'zeha', 'poq', 'Nicholas.Cole', 'tchrist', 'serhiy.storchaka', 'Socob', 'Guillaume Sanchez', 'ishigoya', 'bianjp'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue12568' versions = ['Python 3.8'] ```

vstinner commented 13 years ago

Some characters take more than one column in a terminal, especially CJK (chinese, japanese, korean) characters. If you use such character in a terminal without taking care of the width in columns of each character, the text alignment can be broken. Issue bpo-2382 is an example of this problem.

bpo-2382 and bpo-6755 have patches implementing such function:

unicode_width.patch of bpo-2382 adds unicode.width() method
ucs2w.c of bpo-6755 creates a new ucs2w module with two functions: unichr2w() (width of a character) and ucs2w() (width of a string)

Use test_ucs2w.py of bpo-6755 to test these new functions/methods.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

In the bpo-2382 code, how is the Windows case supposed to work? Also, what about systems that don't have wcswidth? IOW, the patch appears to be incorrect.

I like the bpo-6755 approach better, except that it shouldn't be using hard-coded tables, but instead integrate with Python's version of the UCD. In addition, it should use an accepted, published strategy for determining the width, preferably coming from the Unicode consortium.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

I can attest that being able to get the columns of a grapheme cluster is very important for printing, because you need this to do correct linebreaking. There might be something you can steal from

http://search.cpan.org/perldoc?Unicode::GCString http://search.cpan.org/perldoc?Unicode::LineBreak

which implements UAX#14 on linebreaking and UAX#11 on East Asian widths.

I use this in my own code to help format Unicode strings my columns or lines. The right way would be to build this sort of knowledge into string.format(), but that is much harder, so an intermediary library module seems good enough for now.

vstinner commented 13 years ago

There might be something you can steal from ...

I don't think that Python should reinvent the wheel. We should just reuse wcswidth().

Here is a simple patch exposing wcswidth() function as locale.width().

Example:

>>> import locale
>>> text = '\u3042\u3044\u3046\u3048\u304a'
>>> len(text)
5
>>> locale.width(text)
10
>>> locale.width(' ')
1
>>> locale.width('\U0010abcd')
1
>>> locale.width('\uDC80')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
locale.Error: the string is not printable
>>> locale.width('\U0010FFFF')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
locale.Error: the string is not printable

I don't think that we need locale.width() on Windows because its console has already bigger issues with Unicode: see issue bpo-1602. If you want to display correctly non-ASCII characters on Windows, just avoid the Windows console and use a graphical widget.

vstinner commented 13 years ago

Oh, unicode_width.patch of issue bpo-2382 implements the width on Windows using:

WideCharToMultiByte(CP_ACP, 0, buf, len, NULL, 0, NULL, NULL);

It computes the length of byte string encoded to the ANSI code page. I don't know if it can be seen as the "width" of a character string in the console...

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

I think the WideCharToMultibyte approach is just incorrect.

I'm -1 on using wcswidth, though. We already have unicodedata.east_asian_width, which implements http://unicode.org/reports/tr11/ The outcomes of this function are these:

F: full-width, width 2, compatibility character for a narrow char
H: half-width, width 1, compatibility character for a narrow char
W: wide, width 2
Na: narrow, width 1
A: ambiguous; width 2 in Asian context, width 1 in non-Asian context
N: neutral; not used in Asian text, so has no width. Practically, width can be considered as 1

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Martin v. Löwis \martin@v.loewis.de\ added the comment:

I think the WideCharToMultibyte approach is just incorrect.

I'm -1 on using wcswidth, though.

Like you, I too seriously question using wcswidth() for this at all:

The wcswidth() function either shall return 0 (if pwcs points to a
null wide-character code), or return the number of column positions
to be occupied by the wide-character string pointed to by pwcs, or
return -1 (if any of the first n wide-character codes in the wide-
character string pointed to by pwcs is not a printable wide-
character code).

I would be willing to bet (a small amount of) money it does not correctly inplmented Unicode print widths, even though one would certainly *think* it does according to this:

 The wcswidth() function determines the number of column positions
 required for the first n characters of pwcs, or until a null wide
 character (L'\0') is encountered.

There are a bunch of "interesting" cases I would want it tested against.

We already have unicodedata.east_asian_width, which implements http://unicode.org/reports/tr11/

The outcomes of this function are these:

F: full-width, width 2, compatibility character for a narrow char

H: half-width, width 1, compatibility character for a narrow char

W: wide, width 2

Na: narrow, width 1

A: ambiguous; width 2 in Asian context, width 1 in non-Asian context

N: neutral; not used in Asian text, so has no width. Practically, width can be considered as 1

Um, East_Asian_Width=Ambiguous (EA=A) isn't actually good enough for this. And EA=N cannot be consider 1, either.

For example, some of the Marks are EA=A and some are EA=N, yet how may print columns they take varies. It is usually 0, but can be 1 at the start of the file/string or immediately after a linebreak sequence. Then there are things like the variation selectors which are never anything.

Now consider the many \pC code points, like

U+0009  CHARACTER TABULATION
U+00AD  SOFT HYPHEN 
U+200C  ZERO WIDTH NON-JOINER
U+FEFF  ZERO WIDTH NO-BREAK SPACE
U+2062  INVISIBLE TIMES

A TAB is its own problem but SHY we know is only width=1 immediately before a linebreak or EOF, and ZWNJ and ZWNBSP are both certainly width=0. So are the INVISIBLE * code points.

Context:

Imagine you're trying to format a string so that it takes up exactly 20 columns: you need to know how many spaces to pad it with based on the print width. That is what the bpo-12568 is needing to do, and you have to do much more than East Asian Width properties.

I really do think that what bpo-12568 is asking for is to have the equivalent of the Perl Unicode::GCString's columns() method, and that you aren't going to be able to handle text alignment of Unicode with anything that is much less of that. After all, bpo-12568's title is "Add functions to get the width in columns of a character". I would very much like to compare what columns() thinks compared with what wcswidth() thinks. I bet wcswidth() is very simple-minded at best.

I may of course be wrong.

--tom

vstinner commented 13 years ago

I'm -1 on using wcswidth, though.

When you write text into a console on Linux (e.g. displayed by gnome-terminal or konsole), I suppose that wcswidth() can be used to compute the width of a line. It would help to fix bpo-2382.

Or do you think that wcswidth() gives the wrong result for this use case?

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

> I'm -1 on using wcswidth, though.

When you write text into a console on Linux (e.g. displayed by gnome-terminal or konsole), I suppose that wcswidth() can be used to compute the width of a line. It would help to fix bpo-2382.

Or do you think that wcswidth() gives the wrong result for this use case?

No, I think that using it is not necessary. If you want to compute the width of a line, use unicodedata.east_asian_width. And yes, wcswidth may sometimes produce "incorrect" results (although it's probably correct most of the time).

15951948-b726-4ab9-b654-9b303170b7d2 commented 12 years ago

Could we have an update on the status of this? I ask because if 3.3 is going to (finally) fix unicode for curses, it would be really nice if it were possible to calculate the width of what's being displayed! It looks as if there was never quite agreement on the proper API....

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

Nicholas: I consider this issue fixed. There already *is* any API to compute the width of a character. Closing this as "works for me".

15951948-b726-4ab9-b654-9b303170b7d2 commented 12 years ago

Martin: sorry to be completely dense, but I can't get this to work properly with the python3.3a1 build. Could you post some example code?

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

Please see the attached width.py for an example

4676a0b2-3c88-41fa-880a-895d3d0b2769 commented 12 years ago

Martin, I think you meant to write "if w == 'A':". Some very common characters have ambiguous widths though (e.g. the Greek alphabet), so you can't just raise an error for them.

http://unicode.org/reports/tr11/ says: "Ambiguous characters occur in East Asian legacy character sets as wide characters, but as narrow (i.e., normal-width) characters in non-East Asian usage."

So in practice applications can treat ambiguous characters as narrow by default, with a user setting to use legacy (wide) width.

As Tom pointed out there are also a bunch of zero width characters, and characters with special formatting like tab, soft hyphen, ...

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 12 years ago

I would encourage you to look at the Perl CPAN module Unicode::LineBreak, which fully implements tr11. It includes Unicode::GCString, a class that has a columns() method to determine the print columns. This is very fancy in the case of Asian widths, but of course there are many other cases too.

If you'd like, I can show you a program that uses these, a rewrite the standard Unix fmt(1) filter that works properly on Unicode column widths.

--tom

15951948-b726-4ab9-b654-9b303170b7d2 commented 12 years ago

Marting and Poq: I think the sample code shows up a real problem. "Ambiguous" characters according to unicode may be rendered by curses in different ways.

Don't we need a function that actually reports how curses is going to print a given string, rather than just reporting what the unicode standard says?

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

Martin, I think you meant to write "if w == 'A':". Some very common characters have ambiguous widths though (e.g. the Greek alphabet), so you can't just raise an error for them.

That's precisely why I don't think this should be in the library, but in the application. Application developers who need that also need to concern themselves with the border cases, and decide on how they need to resolve them.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

I would encourage you to look at the Perl CPAN module Unicode::LineBreak, which fully implements tr11.

Thanks for the pointer!

If you'd like, I can show you a program that uses these, a rewrite the standard Unix fmt(1) filter that works properly on Unicode column widths.

I believe there can't be any truly "proper" implementation, as you can't be certain how the terminal will handle these itself. In any case, anybody who is interested in contributing a patch should also be capable of understanding the source of Unicode::LineBreak.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 12 years ago

Martin v. L=C3=B6wis \martin@v.loewis.de\ added the comment:

> Martin, I think you meant to write "if w =3D=3D 'A':". > Some very common characters have ambiguous widths though (e.g. the Greek = alphabet), so you can't just raise an error for them.

That's precisely why I don't think this should be in the library, but in the application. Application developers who need that also need to concern themselves with the border cases, and decide on how they need to resolve them.

The column-width of a string is not an application issue. It is well-defined by Unicode. Again, please see how we've done it in Perl, where tr11 is fully implemented. The columns() method from Unicode::GCString always gives the right answer per the Standard for any string, even what you are calling ambiguous ones.

This is not an applications issue -- at all.

--tom

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

Don't we need a function that actually reports how curses is going to print a given string, rather than just reporting what the unicode standard says?

That may be useful, but

a) this patch doesn't provide that, and b) it may not actually possible to implement such a change in a portable way as there may be no function exposed by the curses implementation that provides this information.

To put my closing this issue differently: I rejected the patch that Victor initially submitted. If anybody wants to contribute a different patch that uses a different strategy, please submit a new issue.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 12 years ago

Martin v. L=C3=B6wis \martin@v.loewis.de\ added the comment:

> I would encourage you to look at the Perl CPAN module Unicode::LineBreak, > which fully implements tr11.

Thanks for the pointer!

> If you'd like, I can show you a program that uses these, a rewrite the > standard Unix fmt(1) filter that works properly on Unicode column widths.

I believe there can't be any truly "proper" implementation, as you can't be certain how the terminal will handle these itself.

Hm. I think we may not be talking about the same thing after all.

If we're talking about the Curses library, or something similar, this is not the same. I do not think Curses has support for combining characters, right to left text, wide characters, etc.

However, Unicode does, and defines the column width for those.

I have an illustration of what this looks like in the picture in the very last recipe, #44, in

http://training.perl.com/scripts/perlunicook.html

That is what I have been talking about by print widths. It's running in a Mac terminal emulator, and unlike the HTML which grabs from too many fonts, the terminal program does the right thing with the widths.

Are we talking about different things?

--tom

4676a0b2-3c88-41fa-880a-895d3d0b2769 commented 12 years ago

It seems this is a bit of a minefield...

GNOME Terminal/libvte has an environment variable (VTE_CJK_WIDTH) to override the handling of ambiguous width characters. It bases its default on the locale (with the comment 'This is basically what GNU libc does').

urxvt just uses system wcwidth.

Xterm uses some voodoo to decide between system wcwidth and mk_wcwidth(_cjk): http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

I think the simplest solution is to just expose libc's wc(s)width. It is widely used and is most likely to match the behaviour of the terminal.

FWIW I wrote a little script to test the widths of all Unicode characters, and came up with the following logic to match libvte behaviour:

def wcwidth(c, legacy_cjk=False):
    if c in u'\t\r\n\10\13\14': raise ValueError('character %r has no intrinsic width' % c)
    if c in u'\0\5\7\16\17': return 0
    if u'\u1160' <= c <= u'\u11ff': return 0 # hangul jamo
    if unicodedata.category(c) in ('Mn', 'Me', 'Cf') and c != u'\u00ad': return 0 # 00ad = soft hyphen
    eaw = unicodedata.east_asian_width(c)
    if eaw in ('F', 'W'): return 2
    if legacy_cjk and eaw == 'A': return 2
    return 1

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

Tom: I don't think Unicode::GCString implements UAX#11 correctly (but this is really out of scope of this issue). In particular, it contains an ad-hoc decision to introduce the EA_Z east-asian width that UAX#11 doesn't talk about.

In most cases, it's probably reasonable to introduce this EA_Z feature. However, there are some significant deviations from UAX#11 here:

combining characters are given EA_Z in sombok/data/custom.pl, even though UAX#11 assigns A or N. UAX#11 points out that the advance width depends on whether or not the terminal performs character combination or not. It's not clear whether Unicode::GCString aims for "strict" UAX#11, or "advance width".
control characters are also given EA_Z, even though UAX#11 gives them EA_N. In this case, it's neither UAX#11 width nor advance width since control characters will have various effects on the terminal (in particular for the tab character)

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

poq: I still remain opposed to exposing wcswidth, since it is just as incorrect as any of the other solutions that people have circulated. I could agree to it if it was called "wcswidth", making it clear that it does whatever the C library does, with whatever semantics the C library wants to give to it (and an availability that depends on whether the C library supports it or not).

That would probably cover the nurses use cases, except that it is not only incorrect with respect to Unicode, but also incorrect with respect to what the terminal may be doing. I guess users would use it anyway.

For Python's internal use, I could accept using the sombok algorithm. I wouldn't expose it, since it again would trick people into believing that it was correct in some sense. Perhaps calling it sombok_width might allow for exposing it.

4676a0b2-3c88-41fa-880a-895d3d0b2769 commented 12 years ago

Martin,

I agree that wcswidth is incorrect with respect to Unicode. However I don't think that's relevant at all. Python should only try to match the behaviour of the terminal.

Since terminals do slightly different things, trying to match them exactly - in all cases, on all systems - is virtually impossible. But AFAICT wcwidth should match the terminal behaviour on nearly all modern systems, so it makes sense to expose it.

15951948-b726-4ab9-b654-9b303170b7d2 commented 12 years ago

Poq: I agree. Guessing from the Unicode standard is going to lead to users having to write some complicated code that people are going have to reinvent over and over, and is not going to be accurate with respect to curses. I'd favour exposing wcwidth.

Martin: I agree that there are going to be cases where it is not correct because the terminal does something strange, but what we need is something that gets as close as possible to what the terminal is likely to be doing (the Unicode standard itself is not really the issue for curses stuff). So whether it is called wcwidth or wcswidth I don't really mind, but I think it would be useful.

The other alternative is to include one of the other ideas that have been mentioned in this thread as part of the library, I suppose, so that people don't have to keep reinventing the wheel for themselves.

The one thing I really don't favour is shipping something that supports wide characters, but gives the users no way of guessing whether or not that is what they are printing, because that is surely going to break a lot of applications.

vstinner commented 12 years ago

Martin: I agree that there are going to be cases where it is not correct because the terminal does something strange, but what we need is something that gets as close as possible to what the terminal is likely to be doing

Can't we expose wcswidth() as locale.strwidth() with a recipe explaining how to use unicodedata to get a "correct" result? At least until everyone implements correctly Unicode and Unicode stops evolving? :-)

--

For unicodedata, a function to get the width of a string would be more convinient than unicodedata.east_asian_width():

>>> import unicodedata
>>> unicodedata.east_asian_width('abc')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: need a single Unicode character as parameter
>>> 'abc'.ljust(unicodedata.east_asian_width(' '))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object cannot be interpreted as an integer

The function posted in msg155361 looks like east_asian_width() is not enough to get the width in columns of a single character.

serhiy-storchaka commented 12 years ago

Has anyone tested wcswidth on FreeBSD, old Solaris? With non-utf8 locales?

terryjreedy commented 11 years ago

In this part of width.py, w = unicodedata.east_asian_width(c) if c == 'A': # ambiguous raise ValueError("ambiguous character %x" % (ord(c)))

I presume that 'c' should be 'w'.

vstinner commented 9 years ago

Since no consensus was found on the definition of the function, and this issue has no activity since 2 years, I close the issue as out of date.

serhiy-storchaka commented 8 years ago

I think this function would be very useful in many parts of interpreter core and standard library. From displaying tracebacks to formatting helps.

Otherwise we are doomed to implement imperfect variants in multiple places.

vstinner commented 7 years ago

Since we failed to agree on this feature, I close the issue.

bitdancer commented 7 years ago

Interestingly, this just came up again in bpo-30717.

serhiy-storchaka commented 7 years ago

At least two other issues depend on this: bpo-17048 and bpo-24665.

If Victor lost interest in this issue I take it. I'm going to push at least imperfect solution which may be improved in time.

vstinner commented 7 years ago

At least two other issues depend on this: bpo-17048 and bpo-24665.

I removed the dependency from bpo-24665 (CJK support for textwrap) to this issue, since its current PR uses unicodedata.east_asian_width(), not the C function wcswidth().

vstinner commented 7 years ago

You need users who use CJK and understand locale issues especially the width of characters. Ask maybe Xiang Zhang and Naoki INADA?

1adc9da5-02bf-4936-8c4c-db74e9a98c0e commented 7 years ago

Hello,

I come from bugs.python.org/issue30717 . I have a pending PR that needs review ( https://github.com/python/cpython/pull/2673 ) adding a function that breaks unicode strings into grapheme clusters (aka what one would intuitively call "a character"). It's based on the grapheme cluster breaking algorithm from TR29.

Let me know if this is of any relevance.

Quick demo:
>>> a=unicodedata.break_graphemes("lol")
>>> list(a)
['l', 'o', 'l']
>>> list(unicodedata.break_graphemes("lo\u0309l"))
['l', 'ỏ', 'l']
>>> list(unicodedata.break_graphemes("lo\u0309\u0301l"))
['l', 'ỏ́', 'l']
>>> list(unicodedata.break_graphemes("lo\u0301l"))
['l', 'ó', 'l']
>>> list(unicodedata.break_graphemes(""))
[]

terryjreedy commented 6 years ago

I suggest reclosing this issue, for the same reason I suggested closure of bpo-24665 in msg321291: abstract unicode 'characters' (graphemes) do not, in general, have fixed physical widths of 0, 1, or 2 n-pixel columns (or spaces). I based that fairly long message on IDLE's multiscript font sample as displayed on Windows 10. In that context, for instance, the width of (fixed-pitch) East Asian characters is about 1.6, not 2.0, times the width of fixed-pitch Ascii characters. Variable-width Tamil characters average about the same. The exact ratio depends on the Latin font used.

I did more experiments with Python started from Command Prompt with code page 437 or 65001 and characters 20 pixels high. The Windows console only allows 'fixed pitch' fonts. East Asian characters, if displayed, are expanded to double width.

However, European characters are not reliably displayed in one column. The width depends on the both the font selected when a character is entered and the current font. The 20 latin1 characters in '¢£¥§©«®¶½ĞÀÁÂÃÄÅÇÐØß' usually display in 20 columns. But if they are entered with the font set to MSGothic, the '§' and '¶' are each displayed in the middle of 2 columns, for 22 total. If the font is changed to MSGothic after entry, the '§' and '¶' are shifted 1/2 column right to overlap the following '©' or '½' without changing the total width. Greek and Cyrillic characters also sometimes take two columns.

I did not test whether the font size (pixel height) affects horizontal column spacing.

vstinner commented 6 years ago

I close the issue as WONTFIX.

python / cpython

Add functions to get the width in columns of a character #56777