insanum / sncli

Simplenote CLI
MIT License
396 stars 34 forks source link

Viewer encoder crash while parsing "smart quote" and "smart dash" unicode characters #25

Closed casutherland closed 8 years ago

casutherland commented 8 years ago

Do we need to spread the new unicode goodness further throughout the code base? If so, I could help dig into this later. Not urgent from my p.o.v.

Just opening this as a new issue as my comment in #23 seems to be unresolved so far by the latest python3 commits, as apparently this "Viewer open with unicode character" case is different than the #23 "Search with unicode characters" case.

Issue summary: The viewer crashes when parsing lines with "smart quote" open/close unicode quote character (e.g., \u2014), which is a common occurrence in any corpus originally sourced from a rich text editor.

UnicodeEncodeError: 'ascii' codec can't encode character '\u2014'

...
  File "/usr/lib/python3/dist-packages/urwid/canvas.py", line 1291, in apply_text_layout
    text[s.offs:s.end])
  File "/usr/lib/python3/dist-packages/urwid/util.py", line 121, in apply_target_encoding
    s = s.encode( _target_encoding )
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 17: ordinal not in range(128)
casutherland commented 8 years ago

Complete traceback...

Traceback (most recent call last):
  File "./sncli", line 33, in 
    sncli.main(sys.argv[1:])
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 1206, in main
    sncli(sync, verbose, config).gui(key)
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 1018, in gui
    self.sncli_loop.run()
  File "/usr/lib/python3/dist-packages/urwid/main_loop.py", line 272, in run
    self.screen.run_wrapper(self._run)
  File "/usr/lib/python3/dist-packages/urwid/raw_display.py", line 242, in run_wrapper
    return fn()
  File "/usr/lib/python3/dist-packages/urwid/main_loop.py", line 337, in _run
    self.event_loop.run()
  File "/usr/lib/python3/dist-packages/urwid/main_loop.py", line 708, in run
    self._loop()
  File "/usr/lib/python3/dist-packages/urwid/main_loop.py", line 787, in _loop
    self._watch_files[fd]()
  File "/usr/lib/python3/dist-packages/urwid/main_loop.py", line 388, in _update
    self.process_input(keys)
  File "/usr/lib/python3/dist-packages/urwid/main_loop.py", line 488, in process_input
    k = self._topmost_widget.keypress(self.screen_size, k)
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 669, in gui_frame_keypress
    self.gui_switch_frame_body(self.view_note)
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 296, in gui_switch_frame_body
    self.gui_body_set(new_view)
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 200, in gui_body_set
    self.gui_update_status_bar()
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 283, in gui_update_status_bar
    self.gui_header_set(self.gui_body_get().get_status_bar())
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 152, in gui_header_set
    self.sncli_loop.draw_screen()
  File "/usr/lib/python3/dist-packages/urwid/main_loop.py", line 563, in draw_screen
    canvas = self._topmost_widget.render(self.screen_size, focus=True)
  File "/usr/lib/python3/dist-packages/urwid/widget.py", line 141, in cached_render
    canv = fn(self, size, focus=focus)
  File "/usr/lib/python3/dist-packages/urwid/container.py", line 1058, in render
    focus and self.focus_part == 'body')
  File "/usr/lib/python3/dist-packages/urwid/widget.py", line 141, in cached_render
    canv = fn(self, size, focus=focus)
  File "/usr/lib/python3/dist-packages/urwid/listbox.py", line 485, in render
    canvas = widget.render((maxcol,))
  File "/usr/lib/python3/dist-packages/urwid/widget.py", line 141, in cached_render
    canv = fn(self, size, focus=focus)
  File "/usr/lib/python3/dist-packages/urwid/decoration.py", line 225, in render
    canv = self._original_widget.render(size, focus=focus)
  File "/usr/lib/python3/dist-packages/urwid/widget.py", line 141, in cached_render
    canv = fn(self, size, focus=focus)
  File "/usr/lib/python3/dist-packages/urwid/widget.py", line 1007, in render
    return apply_text_layout(text, attr, trans, maxcol)
  File "/usr/lib/python3/dist-packages/urwid/canvas.py", line 1291, in apply_text_layout
    text[s.offs:s.end])
  File "/usr/lib/python3/dist-packages/urwid/util.py", line 121, in apply_target_encoding
    s = s.encode( _target_encoding )
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 17: ordinal not in range(128)
samuelallan72 commented 8 years ago

I can't reproduce that with the latest version of sncli... (I can reproduce with a version before the unicode updates - like commit d0e75dd).

By the way - looking at the traceback: '\u2014' is actually an em dash (), and the error is raised within urwid, so not sure what's going on there.

What version of urwid do you have installed? (and just checking - are you getting this error with the latest sncli git version?)

I'm hoping that the updates to string handling will fix all potential encoding errors, but yeah there definitely could be places where it's been missed. :)

casutherland commented 8 years ago

Yeah, I was thinking it must be in the urwid library based on the traceback.

Perhaps as you suggest, it is an outdated library. I am using the .deb python3-urwid package 1.1.1-1build2 [1] and looks like python3 urwid 1.3.1 is stable, upstream. That urwid 1.1.1 deb package was just installed earlier this week, after an apt-update, on ubuntu trusty (14.04.4 LTS). Another potential variable is the ARM / armhf architecture, on a Chromebook, under chroot / crouton.

I will try to pull-down urwid 1.3 somehow to confirm. I assume that will resolve the issue, so we can probably close this issue if you think it is specific to urwid 1.1.

But of course (and worth confirming : ), I had indeed just pulled down the latest insanum/sncli commit, after the "100 note" loop fix to #24 was pushed. (Which I confirmed did resolve that issue, #24. Thanks!).

[1] python3-urwid package 1.1.1-1build2

sudo apt show python3-urwid
Package: python3-urwid
Priority: optional
Section: universe/python
Installed-Size: 893 kB
Maintainer: Ubuntu Developers 
Original-Maintainer: Debian Python Modules Team 
Source: urwid
Version: 1.1.1-1build2
Depends: python3 (>= 3.4~), python3 (<< 3.5), libc6 (>= 2.4)
Download-Size: 151 kB
Homepage: http://excess.org/urwid/
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Origin: Ubuntu
APT-Manual-Installed: yes
APT-Sources: http://ports.ubuntu.com/ubuntu-ports/ trusty/universe armhf Packages
Description: curses-based UI/widget library for Python 3...
casutherland commented 8 years ago

I installed urwid-1.3.1 using pip3 with the new requirements.txt, and that did partially resolve the viewer issue with certain unicode characters ("smart quotes" and "smart dashes/hyphens").

So, with the update from urwid-1.1 to 1.3, the viewer now opens notes containing certain unicode characters.

However there is a different crash when returning from the editor to the viewer. Even when the content is unchanged in the editor, sncli crashes before the viewer re-opens. The new traceback is below.

Note that the crash is the same with the latest commit this evening (a1ab124) as with the commit from 3 days ago.

Traceback (most recent call last):
  File "./sncli", line 33, in 
    sncli.main(sys.argv[1:])
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 1202, in main
    sncli(sync, verbose, config).gui(key)
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 1014, in gui
    self.sncli_loop.run()
  File "/usr/local/lib/python3.4/dist-packages/urwid/main_loop.py", line 278, in run
    self._run()
  File "/usr/local/lib/python3.4/dist-packages/urwid/main_loop.py", line 376, in _run
    self.event_loop.run()
  File "/usr/local/lib/python3.4/dist-packages/urwid/main_loop.py", line 682, in run
    self._loop()
  File "/usr/local/lib/python3.4/dist-packages/urwid/main_loop.py", line 719, in _loop
    self._watch_files[fd]()
  File "/usr/local/lib/python3.4/dist-packages/urwid/raw_display.py", line 393, in 
    event_loop, callback, self.get_available_raw_input())
  File "/usr/local/lib/python3.4/dist-packages/urwid/raw_display.py", line 493, in parse_input
    callback(processed, processed_codes)
  File "/usr/local/lib/python3.4/dist-packages/urwid/main_loop.py", line 403, in _update
    self.process_input(keys)
  File "/usr/local/lib/python3.4/dist-packages/urwid/main_loop.py", line 503, in process_input
    k = self._topmost_widget.keypress(self.screen_size, k)
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 636, in gui_frame_keypress
    content = self.exec_cmd_on_note(note)
  File "/home/<>/Workspaces/sncli/simplenote_cli/sncli.py", line 104, in exec_cmd_on_note
    content = ''.join(temp.tempfile_content(tf))
  File "/home/<>/Workspaces/sncli/simplenote_cli/temp.py", line 37, in tempfile_content
    updated_tf_contents = open(tf.name, 'r').read()
  File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1881: ordinal not in range(128)
samuelallan72 commented 8 years ago

It seems to be a problem with reading the temporary file - I'm guessing as it's being written in binary mode (ie with bytes), so maybe it should be opening for reading in binary mode as well...? I'll look into it, but it's difficult because I can't reproduce this crash on my machine either. (I've tested with python 3.4.4 and 3.5.1, with latest pip packages for urwid and requests.)

samuelallan72 commented 8 years ago

We could probably remove the tempfile_content 'hack' altogether and do something like this (which would probably fix your current error):

diff --git a/simplenote_cli/sncli.py b/simplenote_cli/sncli.py
index 9c56fd6..29252d1 100644
--- a/simplenote_cli/sncli.py
+++ b/simplenote_cli/sncli.py
@@ -101,7 +101,9 @@ class sncli:

         content = None
         if not raw:
-            content = ''.join(temp.tempfile_content(tf))
+            tf.seek(0)
+            content = tf.read().decode('utf-8')
+            # content = ''.join(temp.tempfile_content(tf))
             if not content or content == '\n':
                 content = None

There's still the potential problem mentioned as a comment: "When editing with Gedit, tf file contents weren't getting [updated?]". Whether that's going to be fixed in python3 or not, I don't know.

casutherland commented 8 years ago

I'll look into it, but it's difficult because I can't reproduce this crash on my machine either. (I've tested with python 3.4.4 and 3.5.1, with latest pip packages for urwid and requests.)

I'm on python3 (debian trusty package version: 3.4.0-0ubuntu2) with urwid-1.3.1 installed via pip3.

FWIW, below is a sample note JSON with the offending unicode chars.

Again, at this point the only "unicode crash" that I am experiencing is when returning from "editor" to "viewer" as described in the preceding comment.

$ cat ~/.sncli/e22d8c672adc4e4096db0c9ea6538c5c.json 
{
  "tags": [],
  "version": 8,
  "minversion": 1,
  "syncnum": 8,
  "key": "e22d8c672adc4e4096db0c9ea6538c5c",
  "systemtags": [],
  "modifydate": "1464701314.654185",
  "createdate": "1464700734.462569",
  "deleted": 0,
  "content": "test unicode\n\nSomeplace \u2014 April 1, 2042 \u2014 An example of long dash unicode characters.\n\n\u201cSome text surrounded  by smart quotes.\u201d\n",
  "syncdate": 1464701885.8302944
}
casutherland commented 8 years ago

We could probably remove the tempfile_content 'hack' altogether and do something like this (which would probably fix your current error)

Thanks, @swalladge. Just confirming that the example patch above to sncli.py does indeed resolve my current error.

I tested the patch locally and I am no longer seeing a crash when viewing or editing notes containing unicode characters.

I will attempt to test with Gedit as an editor, to see if there are side-effects from removing the tempfile_content(tf) hack.

There's still the potential problem mentioned as a comment [in temp.py]... Whether that's going to be fixed in python3 or not, I don't know.

# This seems like a hack. When editing with Gedit, tf file contents weren't getting 
# updated in memory, even though it successfully saved on disk.

ref: https://github.com/insanum/sncli/commit/afafc4969aea2b896c7e3bd54f6c08f42f072052

samuelallan72 commented 8 years ago

Great! Ok, so I did some testing myself, (don't know why I didn't before) and yes Gedit still produces the problem... Did some searching around and it seems it's a known problem (and doesn't just affect gedit). Ref this stackoverflow answer.

An alternative that might still fix your problem would be to re-open the file in binary mode and decode to utf-8 from there. See if the following will help:

diff --git a/simplenote_cli/sncli.py b/simplenote_cli/sncli.py
index 9c56fd6..4926155 100644
--- a/simplenote_cli/sncli.py
+++ b/simplenote_cli/sncli.py
@@ -101,7 +101,7 @@ class sncli:

         content = None
         if not raw:
-            content = ''.join(temp.tempfile_content(tf))
+            content = temp.tempfile_content(tf)
             if not content or content == '\n':
                 content = None

diff --git a/simplenote_cli/temp.py b/simplenote_cli/temp.py
index ec9d428..a469eb4 100644
--- a/simplenote_cli/temp.py
+++ b/simplenote_cli/temp.py
@@ -26,6 +26,7 @@ def tempfile_create(note, raw=False):

 def tempfile_delete(tf):
     if tf:
+        tf.close()
         os.unlink(tf.name)

 def tempfile_name(tf):
@@ -34,7 +35,13 @@ def tempfile_name(tf):
     return ''

 def tempfile_content(tf):
-    # This seems like a hack. When editing with Gedit, tf file contents weren't getting 
-    updated_tf_contents = open(tf.name, 'r').read()
-    tf.write(updated_tf_contents.encode('utf-8'))
-    return updated_tf_contents
+    # This 'hack' is needed because some editors use an intermediate temporary
+    # file, and rename it to that of the correct file, overwriting it. This
+    # means that the tf file handle won't be updated with the new contents, and
+    # the tempfile must be re-opened and read
+    if not tf:
+        return None
+
+    with open(tf.name, 'rb') as f:
+        updated_tf_contents = f.read()
+        return updated_tf_contents.decode('utf-8')
casutherland commented 8 years ago

See if the following will help:

Confirmed. This latest patch does help. Thanks @swalladge!

I tested this patch while opening the viewer from the index as well as while opening the viewer after opening the unicode note in the editor (vim), both with and without editing the content.

samuelallan72 commented 8 years ago

@qcu great! Fix pushed - that should (hopefully) be the last of the unicode encoding trouble.