jupyterlab / jupyterlab

JupyterLab computational environment.
https://jupyterlab.readthedocs.io/
Other
14.18k stars 3.39k forks source link

Jupyter misdisplying Python lists with Arabic and alphanumeric elements #3846

Open oryxius opened 6 years ago

oryxius commented 6 years ago

Hello everyone,

I am a computational linguist who uses Python and Jupyter to work on Arabic. Along with Spyder, I found it to be the best IDE for my purposes. Of course, the Markdown gives Jupyter a clear edge. But I recently ran into an issue when printing lists (and tuples and dictionaries) that contain Arabic string elements and alphanumeric elements. So for example if you run:

en = '7X'
print (en)
ar = 'عربي'
print (ar)
print ([en, ar])
print ([ar, en])

capture

Somehow, if the alphanumeric ends with letter(s), these jump and get displayed with the Arabic element. I have posted this issue at Stackoverflow, but haven't received any solutions so far.

jasongrout commented 6 years ago

Thanks! Let's see if we can narrow down the problem to which part of the ecosystem is causing the issue. It appears that this happens even in ipython 6.2.1 (python 3.6.4):

Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:54:01) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: en = '7X'
   ...: print (en)
   ...: ar = 'عربي'
   ...: print (ar)
   ...: print ([en, ar])
   ...: print ([ar, en])
   ...: 
7X
عربي
['7X', 'عربي']
['عربي', '7X']

So it looks like it's not JupyterLab (this repo), but something much more fundamental. In fact, trying with pure python (i.e., no Jupyter involved) also gives the issue for me.

Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:54:01) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> en = '7X'
>>> print (en)
7X
>>> ar = 'عربي'
>>> print (ar)
عربي
>>> print ([en, ar])
['7X', 'عربي']
>>> print ([ar, en])
['عربي', '7X']

If you try this in pure python at the command line, does it also give the problem for you? If so, it sounds like it is a much deeper issue with the language, not with Jupyter.

jasongrout commented 6 years ago

If it is a python problem, it's best to report it to the python issue tracker: https://bugs.python.org/

ian-r-rose commented 6 years ago

Hang on, when I try it in both python and IPython, I get the expected result. When I try it in the classic notebook or JupyterLab, I get the incorrect result.

So on my machine, at least, it is a notebook problem.

jasongrout commented 6 years ago

Thanks for confirming! What version of python are you using? I'm using the conda-forge 3.6.4-0 package for macOS, in the OS X terminal (in case that has anything to do with it).

ian-r-rose commented 6 years ago

I am using 3.6.0 at the moment, on Ubuntu.

jasongrout commented 6 years ago

Are you using the same python that the notebook and lab are using? Or are you using the system python in one case and a different python in the other case?

ian-r-rose commented 6 years ago

It's the same conda-installed python.

jasongrout commented 6 years ago

In contrast, when I use the macOS system python, things appear to work correctly:

Python 2.7.11 (default, Dec 26 2015, 17:47:15) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> en = '7X'
>>> print (en)
7X
>>> ar = 'عربي'
>>> print (ar)
عربي
>>> print ([en, ar])
['7X', '\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a']
>>> print ([ar, en])
['\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a', '7X']

@ian-r-rose - what exact python conda package are you using? (channel, build, etc.)

ian-r-rose commented 6 years ago

image

jasongrout commented 6 years ago

So you're using the Anaconda python package, not the conda-forge one? Can you try with the conda-forge one?

ian-r-rose commented 6 years ago

Still works fine with the conda-forge one: image

jasongrout commented 6 years ago

Very weird, then. @oryxius - what do you see with python, and what is your OS and python package?

jasongrout commented 6 years ago

I can't read Arabic, but it does appear that the Arabic letters in @ian-r-rose's screenshot are different than in the example above. @ian-r-rose has

screen shot 2018-02-09 at 8 11 51 am

but the example in the original post has

screen shot 2018-02-09 at 8 12 10 am

(Also, @ian-r-rose, I noticed that your terminal is somewhat transparent, and I can read what's underneath - maybe useful for you to keep in mind when posting screenshots :).

ian-r-rose commented 6 years ago

Yeah, I noticed that, but since it's just this issue :)

jasongrout commented 6 years ago

@ian-r-rose how did you get the Arabic characters in your screenshot? They look totally different (but I don't know - maybe they are actually the same?)

ian-r-rose commented 6 years ago

I copied and pasted from the first post. I don't know how to get them otherwise.

ian-r-rose commented 6 years ago

I should note that when I paste them into the notebook interface they look the same as the initial screenshot, so...maybe flaky font rendering?

ian-r-rose commented 6 years ago

After some more digging:

I am wondering whether this is a browser bug: the message coming in from the websocket connection looks okay to me. But then when it is parsed it winds up wrong. In the normal JS console on Firefox, if I enter

JSON.parse("{ \"content\": [\"\u0639\u0631\u0628\u064a\", \"7X\"] }")

I get

content: Array [ "عربي", "7X" ]

Edit: this reproduces the issue in both Chrome and Firefox:

JSON.parse("\"'\u0639\u0631\u0628\u064a', '7X'\"")
jasongrout commented 6 years ago

@ian-r-rose - interestingly, your browser console experiment shows the bug in Firefox, but not in Chrome, for me.

ian-r-rose commented 6 years ago

Me too, though in Chrome the whole message still gets parsed incorrectly (for reasons I have not figured out)

ian-r-rose commented 6 years ago

Also, I have no idea why your python interpreter is also showing this bug, @jasongrout

jasongrout commented 6 years ago

This is getting more and more weird. It seems that there are bugs across multiple applications regarding this.

oryxius commented 6 years ago

@jasongrout & @ian-r-rose Thank you both for the quick response. First, I am using this on Microsoft Surface Studio running Windows 10. I get the problem on both Jupyter Classic and JupyterLab which I have in WinPython-64bit-3.6.3.0Qt5. The code prints correctly on WinPython's Spyder IDE.

oryxius commented 6 years ago

@jasongrout Yes, the Arabic in ian-r-rose does not display accurately. It is basically displaying the Arabic letters of the word in reverse order (or so it appears to the viewer) and in their unconnected forms. Normally this is a Unicode support issue in simple text terminals.

jasongrout commented 6 years ago

I wonder if, ironically, the lack of proper support for display is what makes it work.

oryxius commented 6 years ago

Also, I tested it in four browsers: Chrome, Firefox, Edge, and Opera and it appears in all of them.

ian-r-rose commented 6 years ago

Yeah, I wonder if they got reversed in my copy-paste buffer.

It seems to me like this is indeed a RTL vs LTR error, specifically which parts of the string get assigned LTR and which get assigned RTL (cf. discussion here and here). If I enter

JSON.parse("\"\u0639X7\"")

it displays "عX7" as expected (or, at least, how I as an English speaker would expect). If, however, I enter

JSON.parse("\"\u06397X\"")

it displays "ع7X". That is to say, the numeric character gets assigned to the RTL portion of the string. In the original example, the browser string parser was not knowledgeable enough about Python syntax to know to assign the 7 to its list member, rather than the other one.

At least, this is my guess about what is going on. As for a fix, this seems really tough, since it is happening at a very low level in browser unicode support. I fear that fixes we would try would end up breaking other things.

oryxius commented 6 years ago

Here is how it displays in Spyder: capture2 The console basically aligns to the right any string or list that begins in Arabic.

ian-r-rose commented 6 years ago

cc @minrk and @Carreau , who have a deeper knowledge of Unicode than I.

jasongrout commented 6 years ago

And CC also @afshin, who can read Arabic, and @samarsultan, who did lots of work on bidi in the classic Notebook (e.g., https://github.com/jupyter/notebook/pull/2357, https://github.com/jupyter/notebook/issues/2178, https://github.com/jupyter/notebook/issues/2156)

Carreau commented 6 years ago

Ouch just getting the ping and this one is nasty. The other thing we need to test is whether it works correctly when flipping the classic notebook interface to use RTL layout (which you can do in the command palette).

jasonzhao2021 commented 2 years ago

it looks like it's caused by mixed typesetting,refer: https://www.w3.org/TR/alreq/#h_direction " Arabic script is written from right to left. Numbers, even Arabic numbers, are written from left to right, as is text in a script that is normally left-to-right. When the main script is Arabic, the layout and structure of pages and documents are also set from right to left. "

all data is OK but the JavaScript object toString function return wrong image