wcswidth() returning -1 not handled (and newline in PS1 not handled)

andychu commented 5 years ago

I just merged this, but when I tested it against some random unicode prompt, libc.wcswidth() is returning -1? That causes an AssertionError later.

I reduced it to this, which works in bash:

PS1=$'\xe2\x86\x92'

Can you reproduce this on your machine? Either way, we should handle the -1 return value.

But I'm not sure why this is happening. Maybe the '\x01 stripping isn't right? That would be an unprintable character.

https://unix.stackexchange.com/questions/25903/awesome-symbols-and-characters-in-a-bash-prompt

Originally posted by @andychu in https://github.com/oilshell/oil/pull/368#issuecomment-508335695

andychu commented 5 years ago

@jyn514 Here's a simpler repro. (not sure why the last one was flaky)

A newline in the prompt causes this problem. So I think we should check -1 and then return the fallback... or maybe somehow indicate a warning? newlines in prompts are probably not uncommon.

Maybe we have to additionally split lines? And only calculate width of last line? So there could be 2 changes here -- check -1, and account for newlines in another way.

andy@lisa:~/git/oilshell/oil$ bin/osh --rcfile /dev/null
osh$ PS1=$'\n'

ls 
 .git/                 .gitignore            .mypy_cache/          .swp                  .travis.yml           .vagrant/             INSTALL.txt           LICENSE.txt           Makefile              NOTES-cpython.txt     
 NOTES.txt             Python-2.7.13/        README.md             Rplots.pdf            TODO-completion.txt   TODO.txt              Vagrantfile           __init__.py           __pycache__/          _bin/                 
 _build/               _chroot/              _deps/                _devbuild/            _release/             _tmp/                 array.patch           asdl.txt              asdl/                 benchmarks/           
 bin/                  build/                configure             core/                 demo/                 devtools/             dir/                  doc/                  edit.txt              fastlex.pyi           
 fastlex.so            file                  frontend/             gold/                 install               libc.pyi              libc.so               line_input.so         local.sh              logs-0.6.pre13.tar    
 metrics/              misc/                 mycpp/                native/               oil-version.txt       oil_lang/             opy/                  osh/                  ovm2/                 pgen2/                
 portable-rules.mk     posix_.pyi            posix_.so             pylib/                r.txt                 release-notes.txt     release-roadmap.txt   spec/                 test/                 testdata/             
 tmp.txt               tools/                types/                uftrace.data.old/     uftrace.data/         vendor/               web/                  
Traceback (most recent call last):
  File "/home/andy/git/oilshell/oil/core/comp_ui.py", line 89, in PrintCandidates
    self._PrintCandidates(*args)
  File "/home/andy/git/oilshell/oil/core/comp_ui.py", line 374, in _PrintCandidates
    self._ReturnToPrompt(num_lines+1)
  File "/home/andy/git/oilshell/oil/core/comp_ui.py", line 309, in _ReturnToPrompt
    assert last_prompt_len != -1
AssertionError

jyn514 commented 5 years ago

I'm not sure that the autocomplete UI would work even if we only counted the last line, that's something to look into. -1 should probably be handled the same way as other errors, by just counting the number of bytes. We should also print a warning if the prompt is invalid or the UTF8 locale isn't installed instead of silently failing.

jyn514 commented 5 years ago

From https://github.com/oilshell/oil/pull/368#issuecomment-508510117:

I can't replicate the error for PS1=$'\xe2\x86\x92' on my machine. Here's my locale:
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
I noticed you have LANGUAGE=en_US, could you try changing that and seeing what happens? Currently we only change LC_CTYPE because that's what the linux man pages document, maybe it's different on MacOS.

The newline threw an assertion error for me as well.

andychu commented 5 years ago

Yeah it's possible we may need to take into account the prompt height in the future. But for now I think fixing the AssertionError is OK.

I would like to print a message rather than silently failing, but I'm not sure where to do it.

I still get the same error:

[osh] lisa ~/git/oilshell/oil$ export LANGUAGE=
[osh] lisa ~/git/oilshell/oil$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
[osh] lisa ~/git/oilshell/oil$ PS1=$'\n'

ls 
 .git/                 .gitignore            .mypy_cache/          .swp                  .travis.yml           .vagrant/             INSTALL.txt           LICENSE.txt           Makefile              NOTES-cpython.txt     
 NOTES.txt             Python-2.7.13/        README.md             Rplots.pdf            TODO-completion.txt   TODO.txt              Vagrantfile           __init__.py           __pycache__/          _bin/                 
 _build/               _chroot/              _deps/                _devbuild/            _release/             _tmp/                 array.patch           asdl.txt              asdl/                 benchmarks/           
 bin/                  build/                configure             core/                 demo/                 devtools/             dir/                  doc/                  edit.txt              fastlex.pyi           
 fastlex.so            file                  frontend/             gold/                 install               libc.pyi              libc.so               line_input.so         local.sh              logs-0.6.pre13.tar    
 metrics/              misc/                 mycpp/                native/               oil-version.txt       oil_lang/             opy/                  osh/                  ovm2/                 pgen2/                
 portable-rules.mk     posix_.pyi            posix_.so             pylib/                r.txt                 release-notes.txt     release-roadmap.txt   spec/                 test/                 testdata/             
 tmp.txt               tools/                types/                uftrace.data.old/     uftrace.data/         vendor/               web/                  
Traceback (most recent call last):
  File "/home/andy/git/oilshell/oil/core/comp_ui.py", line 89, in PrintCandidates
    self._PrintCandidates(*args)
  File "/home/andy/git/oilshell/oil/core/comp_ui.py", line 374, in _PrintCandidates
    self._ReturnToPrompt(num_lines+1)
  File "/home/andy/git/oilshell/oil/core/comp_ui.py", line 309, in _ReturnToPrompt
    assert last_prompt_len != -1
AssertionError

jyn514 commented 5 years ago

I meant for the unicode arrow, not for the newline.

andychu commented 5 years ago

Does it make a difference? either way we have to handle -1.

I must have made some kind of error minimizing the case. The full prompt in the stackoverflow case failed. But it looks like the cause was the \n and not the arrow.

jyn514 commented 5 years ago

How would you expect this to behave for a newline inside an escape sequence? For example PS1=$'abc\x01\033[0;31m\n\x02> ', should it be len('abc> ') or len('> ')? Or is it malformed and not worth testing?

andychu commented 5 years ago

I'm not sure... I try to come up with a strategy to test what bash does when I don't know.

But I think you can just do a mechanical change to fall back on len() if -1 is returned.

That is, follow the existing logic and remove everything inside \x01 \x02, and then calculate the width.

"monotonically increasing correctness" is OK... i.e. I would fix this bug first and then worry about multiline prompts, which may require larger changes to other parts of the code.

andychu commented 5 years ago

The important part is really testing. So when we refactor later, we don't break this.

In this case I would add a unit tests for the PromptLen function. It can return -1 now, but it never should, because that breaks an invariant in the rest of the program (caught by the assertion).

Prior to the unicode change, the function could never return -1.

jyn514 commented 5 years ago

Do you use property-based testing? It would be nice to have something like https://hypothesis.readthedocs.io/en/latest/quickstart.html that asserts that the function never returns -1 instead of coming up with test cases manually.

jyn514 commented 5 years ago

Something like this:

diff --git a/core/comp_ui_test.py b/core/comp_ui_test.py
index 7650f7ce..aa95a981 100755
--- a/core/comp_ui_test.py
+++ b/core/comp_ui_test.py
@@ -7,6 +7,8 @@ from __future__ import print_function
 import cStringIO
 import sys
 import unittest
+import hypothesis
+from hypothesis.strategies import text

 from core import comp_ui  # module under test
 from core import util
@@ -170,9 +172,9 @@ class PromptTest(unittest.TestCase):
     self.assertEqual(comp_ui._PromptLen("abc\ndef"), 3)
     self.assertEqual(comp_ui._PromptLen(""), 0)

-  def testControlCharacters(self):
-    self.assertEqual(comp_ui._PromptLen("\xef"), 1)
-    self.assertEqual(comp_ui._PromptLen("\x03\x05"), 2)
+  @hypothesis.given(text())
+  def testNeverPanics(self, s):
+    assert comp_ui._PromptLen(s) >= 0, repr(s)

 if __name__ == '__main__':
   unittest.main()
diff --git a/vendor/typing.py b/vendor/typing.py
index b0fdb1c2..a9a700bf 100644
--- a/vendor/typing.py
+++ b/vendor/typing.py
@@ -10,6 +10,9 @@
 # if TYPE_CHECKING:
 #   NullFunc = Callable[[int, int], int]

+TypingMeta = None
+TypeVar = None
+_ForwardRef = None
 List = None
 Tuple = None
 Optional = None

andychu commented 5 years ago

It's something I've been interested in, but haven't gotten around to trying.

Does it actually fail with the old code in this case? Like it manages to find the \n ? I guess it can hard-code those special strings and try it.

If so it might be nice to start trying it. We would have to sort ouf the PIP / Travis dependencies, which shouldn't be hard.

(Also if it takes a long time to run, I would put them in a different file so we can selectively run them. I like that the unit tests run pretty fast, and have no deps)

jyn514 commented 5 years ago

It runs pretty fast, but yes it introduces a dependency. It caught the -1, I think it came up with \x1f or something like that.

andychu commented 5 years ago

OK great, I would accept a PR to add it. Although I would still be inclined to put things in a different file to start with. When we gain confidence with it, we can move them into the normal test files.

Maybe one per dir, like osh/property_tests.py and core/property_tests.py ? The _tests.py suffix shouldn't get picked up by test/unit.sh, which is probably what we want at first.

Running it in Travis would be nice too but that could be a followup change.

The one caveat I have is that internal invariants are subject to change / refactoring -- hence my preference for spec tests over unit tests. But yes in this case there was a specific runtime invariant that the assert was protecting.

jyn514 commented 5 years ago

This just caught another error! PyArg_ParseTuple throws a TypeError if a string contains a null byte.

andychu commented 5 years ago

OK interesting. Yeah I have been wondering about this... I think this will happen in all the other C extension too.

bash silently truncates.

$ echo $'foo\x00bar'
foo

mksh seems not to though...

Python builtins raise an error:

>>> os.listdir('bin\0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: listdir() argument 1 must be encoded string without null bytes, not str

Not sure yet what we should do... great find though!!!

jyn514 commented 5 years ago

Hmm python seems to handle it fine as long as it doesn't go through a C extension:

(osh) $ echo $'foo\x00bar'
foobar
(osh) $ PS1=$'foo > \x00bar > '
foo >

andychu commented 5 years ago

Released with 0.7.pre1: https://www.oilshell.org/release/0.7.pre1/

oils-for-unix / oils

wcswidth() returning -1 not handled (and newline in PS1 not handled) #394