Converter chokes on some UTF-8 characters

PhiLhoSoft commented 4 years ago

I try to convert my medium-sized Simplenote backup file (96 entries, 430 KB), using Python 3.8.5 on Windows 10, but the decoder choked on some characters, probably from supplementary planes (beyond the BMP), with messages like:

Traceback (most recent call last):
  File "simplenote2enex.py", line 305, in <module>
    main(args)
  File "simplenote2enex.py", line 280, in main
    enex_file = sne.process_file()
  File "simplenote2enex.py", line 202, in process_file
    simplenotes = json.load(jfp)
  File "C:\Languages\Python38\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Languages\Python38\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 84101: character maps to <undefined>

I started to replace these characters, locating the place in a hex editor, finding back the text in a text editor displaying correctly the emoji like 😁 (some others are OK), the Japanese chars like 育美強針 or the special chars like 𝑬𝑵 or 𝑭𝑹, and changing these. But it was slow…

So I searched the json.load documentation, but it said it used UTF-8 by default, contradicted by the cp1252 information above… So I searched issues about Python choking on these characters, and I found a trick, changed as the following line: with codecs.open(self.json_file, 'r', 'utf-8') as jfp: (instead of with open(self.json) as jfp:) (Need to add import codecs at the start.)

It worked, but I had a different error:

Processing file: notes.json
Notes author:  Philippe Lhoste
Active notes:   96
Converted 96 notes
Traceback (most recent call last):
  File "simplenote2enex.py", line 306, in <module>
    main(args)
  File "simplenote2enex.py", line 282, in main
    print(enex_file)
  File "C:\Languages\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4c5' in position 7039: character maps to <undefined>

Fortunately, I found earlier another solution, transcoding the standard output:

import sys
import io

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="UTF-8")

(just after the imports)

And it worked! I don't know if these are the best way to solve the issue (I am not a Python coder), but it worked for me.

HTH.

rpgd60 commented 4 years ago

Hi Philippe Many thanks. It does help. I will try to address the error, most likely using your fix :-) Cheers Rafa

PhiLhoSoft commented 4 years ago

No problem using my fix, you are welcome.

Other remarks seen after import:

XML is better as <en-note>{enex_content}</en-note> otherwise the indenting at the start of the content makes the first line to be in fixed-width rendering.
I notice a typo: 'thrashedNotes' instead of 'trashedNotes'…
Generated file has line ending as CR CR LF. Still looking how to fix this one. It doesn't seem to affect the import, but it isn't very clean.
Double CR LF (empty lines) are ignored by Joplin, not sure if it is an issue with the exported format or Joplin itself. I thought it was related to the previous remark, but no.

PhiLhoSoft commented 4 years ago

Update: I reported the last issue to Joplin: https://github.com/laurent22/joplin/issues/3578 I have a workaround. Here is the file I modified for my needs. simplenote2enex.py.txt

Not sure of the CR CR LF issue, but it is only cosmetic, so I will leave it as is.

Oh, and thank you for this useful tool! I would hesitate to change otherwise.

rpgd60 commented 4 years ago

Hi Philippe Can you send me a simple json file with some of the characters that caused the choking?
For starters, I have added to the repository a file test.issue.001.json with the japanese and special characters you mentioned above. You may want to use that as a base. Many thanks

rpgd60 commented 4 years ago

Hi Philippe, I ran simplenote2enex.py (same as in repository) against file test.issue.001.json and it converted it OK. I paste the output below and attach the xml output as a file. My environment: Kubuntu 19.10, python 3.7.5. I will try in a windows system later.

python simplenote2enex.py --json-file ./test.issue.001.json --author Rafa --create-title --verbose-level 1 > /tmp/output.txt
Processing file: ./test.issue.001.json 
Notes author:  Rafa
Active notes:   3
Trashed notes:  0 -- will not be converted to ENEX
Converted 3 notes

output.enex.xml.txt

veonne commented 4 years ago

Hi @rpgd60,

FYI I have a considerable amount of notes in Japanese language. So far I didn't encounter an error when processing some notes through this tool (latest version as of writing this comment). In case an error pops up, I'll try to let you know.

My environment is MacOS Catalina v10.15.6, Python 3.8.5.

Thanks.

rpgd60 commented 4 years ago

Many thanks for the feedback. Glad to hear that.

PhiLhoSoft commented 4 years ago

Sorry for the late answer, I was in vacations lately. I will try and make a simplified Json with some problematic characters. Note that the emoji you pasted is a reference to an image, not the U+1F601 character. And maybe the issue is with the Windows command line terminal, which is not the best when it comes to Unicode character handling, since your script works fine in Linux and MacOS. I believe the patched script I attached above (did you look at it?) should work everywhere, fixing Windows issue (I hope).

rpgd60 commented 4 years ago

Hi Philippe Thank you very much for the inputs. I'd really appreciate if you can send me a representative json file for testing. I'd rather not implement the workaround until I can test it in a failed case. Cheers

PhiLhoSoft commented 4 years ago

Yes, sorry, I still have to do that… But I guess the problem is Windows specific. I hope you can test in this environment.

PhiLhoSoft commented 4 years ago

OK, I did it. I extracted some notes (removing content to make them shorter) with some significant samples: one with a bunch of emoticons, one with the special characters, one with a table (that was correctly imported in Joplin) and one with Japanese characters. Most text is in French, but it is not relevant anyway.

SimplenotesExportSample.json.txt

With current code:

> py simplenote2enex.py --json-file SimplenotesExportSample.json --author 'PhiLhoSoft' --create-title --verbose-level 1 > test1.enex

Traceback (most recent call last):
  File "simplenote2enex.py", line 367, in <module>
    main(args)
  File "simplenote2enex.py", line 340, in main
    enex_file = sne.process_file()
  File "simplenote2enex.py", line 262, in process_file
    simplenotes = json.load(jfp)
  File "C:\Languages\Python38\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Languages\Python38\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 212: character maps to <undefined>

With my fixes:

> py simplenote2enex-pl.py --json-file SimplenotesExportSample.json --author 'PhiLhoSoft' --create-title --verbose-level 1 > test1.enex
Processing file: SimplenotesExportSample.json
Notes author:  'PhiLhoSoft'
Active notes:   4
Trashed notes:  0 -- will not be converted to ENEX
Converted 4 notes

rpgd60 commented 3 years ago

Issue not addressed / fixed. See first section of the README.md file for additional background

rpgd60 / simplenote2joplin

Converter chokes on some UTF-8 characters #1