Retext on Windows can not save with error "'Charmap' codec can't encode character ..."

szpeter80 commented 2 years ago

The exact error message is: "'Charmap' codec can't encode character xxxx in position yyy: character maps to " .

How to reproduce ?

install retext on windows via 'pip install retext'
open a new document, enter 7bit clean text, save as 'test.md'
enter a national character in the markdown source then try to save
the error message pops up, and the output file is truncated to 0 byte (previous work is lost)

As far as i was to able to track down this, the issue is specific to Windows' codepage handling. The copy of Retext installed to the venv exhibits this behaviour either when launched from a cmd shell or with double click on retext.exe. The trigger was a single character (for example \u0151 (ő) which is not representable with the character page used by the cmd (chcp reports 437, windows installed as English)

The issue (4) is not observed, if the windows single-byte codepage can handle the accents wich is entered (in this case on a different machine, the install locale was "Hungarian", the resulting codepage is CP852 (weird, this is old ms-dos codepage, windows used to use cp12xx back then), and that codepage has a code reserved for "ő" and Retext saves the document successfully (altough in CP852 encoding).

Workaround 1: install Retext globally (no venv). For any reason, this behaviour is not observed if installed globally. The issue i see with this is that the install directory of the global packages is an app (=python, it installs as an app from the Microsoft Store) specific directory containing the version of Python installed, which might get deleted when the package updates or might left there as leftover (and possibly broken) junk.

Workaround 2: Start Retext from a batch file, and issue the chcp 65001 command before invoking retext.exe. 65001 is the code for unicode code page and this seems to solve the unrepresentable character issue. Beware, if the markdown source was created before, it might be in ansi (1-byte) encoding, and needs to be checked and converted to unicode / utf-8 (eg via Notepad++).

mitya57 commented 1 year ago

I think we should always use UTF-8 by default. 1-byte regional encodings are so outdated in 2022, and UTF-8 is the default on Linux anyway.

It shouldn't be a problem for existing documents. ReText uses chardet, so existing documents will be opened/saved with whatever encoding they have, provided that it's detected correctly.

What do you think?

Also, I will fix truncating the file to 0 bytes when the current encoding does not support some characters.

szpeter80 commented 1 year ago

As far as i can tell, the Windows installer of Python tries to take care the "chcp 65001" by including it in its wrapper script, just for some reason it is not effective all the times. It's not Retext's job to fix a win-py compatibility problem.

If you can fix the file truncating problem, that would prevent the user to shoot itself in the foot unknowingly. Thanks !

retext-project / retext

Retext on Windows can not save with error "'Charmap' codec can't encode character ..." #599