python / cpython

The Python programming language
https://www.python.org
Other
63.21k stars 30.27k forks source link

Display of Unicode strings with bidi characters #86456

Closed 2c45e5bd-3f35-4c26-b24b-b484594f2279 closed 3 years ago

2c45e5bd-3f35-4c26-b24b-b484594f2279 commented 3 years ago
BPO 42290
Nosy @terryjreedy, @ezio-melotti, @stevendaprano

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-bug', 'invalid', 'expert-unicode'] title = 'Display of Unicode strings with bidi characters' updated_at = user = 'https://bugs.python.org/xxm' ``` bugs.python.org fields: ```python activity = actor = 'terry.reedy' assignee = 'none' closed = True closed_date = closer = 'terry.reedy' components = ['Unicode'] creation = creator = 'xxm' dependencies = [] files = [] hgrepos = [] issue_num = 42290 keywords = [] message_count = 3.0 messages = ['380534', '380536', '380943'] nosy_count = 4.0 nosy_names = ['terry.reedy', 'ezio.melotti', 'steven.daprano', 'xxm'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue42290' versions = ['Python 3.6'] ```

2c45e5bd-3f35-4c26-b24b-b484594f2279 commented 3 years ago
When printing an assignment expression with unicode ܯ ( \U+072F)  on the command line, we get an unexpected result.
Example A:
>>> print(chr(1839)+" = 1")
ܯ = 1

Similar problems exist in plenty of characters of unicode.

stevendaprano commented 3 years ago

Works for me:

>>> chr(1839)+'1'
'ܯ1'

You are mixing a right-to-left code point (DHALATH) with a left-to-right code point (digit 1). The result depends on the quality of your console or terminal. Try using a different terminal.

On my system, the terminal displays the DHALATH on the left, and the digit 1 on the right; when pasted into my browser, it displays them in the reverse order. I don't know which is correct: bidirectional text is complex and I don't know the rules for mixing characters with different bidirection classes.

But whichever display is correct, this has nothing to do with Python. It depends on the quality of the bidirectional text rendering of the browser and the terminal.

If your terminal displays the wrong results, that's a bug in the terminal. What terminal are you using, in what OS? Try using a different terminal.

You can check that Python is doing the right thing:

>>> s = chr(1839)+'1'
>>> s == '\N{SYRIAC LETTER PERSIAN DHALATH}1'
True

If your system reports True, then Python has made the string you asked for, and the result of printing depends on the capabilities of the terminal, and the available glyphs in the typeface used by the terminal. There's nothing Python can do about that.

terryjreedy commented 3 years ago

Xia, when saying 'unexpected', one usually needs to also say what was expected. When discussing mixed direction chars, we need to be especially careful in describing what we see with different terminals, different browsers, and different OSes.

Steven: On Windows, I see the same thing: "Daleth 1" prints as that in both IDLE's Shell and Python's REPL in Command Prompt (with D a replacement box in the latter) but is reversed here 'ܯ1' in Firefox (and the same in Microsoft Edge. But, I just discovered, the two browsers (and Notepad and LibreOffice Writer and likely other text editors) treat runs of latin digits specially: "Daleth a" pastes in that order, 'ܯa', and "Daleth 1 2" pastes as "1 2 Daleth", 'ܯ12'.

The block, but not the individual digits, is reversed. This allows R2L writers to use what are now the global digits. In Arabic, numbers are written and read R 2 L low order to high. So Europeans used to writing and reading L 2 R high to low kept the same order. Perhaps the bidi property of the digits in the unicode datebase is different from that of other latin chars.

It seems that '=' is also bidirectional, but properly not treated as digit. "Daleth = 1" is reversed in both browsers and text editors to read 'Daleth' 'equals' 'one' when read right to left.

The general rule is that blocks of same direction chars are written appropriately as encountered.  It seems that the classification of some characters depends on the context.  The following is as expected,
>>> 'ab'+chr(1837)+chr(1838)+chr(1839)+'cd'
'abܭܮܯcd'
with the R2L triplet reversed.

In any case, Steven is correct that Python correctly stores chars in the order given and that there is no Python bug.