mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
811 stars 121 forks source link

Unicode Error #25

Closed aschilling closed 6 years ago

aschilling commented 7 years ago

Hi,

first of all congratulations for mammoth. It is really a great tool. Unfortuantely, when I run mammoth with by document I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 42056: character maps to

Do you have any idea, what could be the issue here and how I could fix it. I run mammoth on windows 10

Update: In particular the issue occurs if you use "wingdings" font with character "§" symbol

Moreover I figured that symbols such as arrow keys are not exported correctly. Here I get the error: An unrecognised element was ignored: w:sym

mwilliamson commented 7 years ago

Thanks for the kind words. To help work out the problem, could you provide:

On Mon, 16 Jan 2017 01:36:41 -0800 Andreas Schilling notifications@github.com wrote:

Hi,

first of all congratulations for mammoth. It is really a great tool. Unfortuantely, when I run mammoth with by document I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 42056: character maps to

Do you have any idea, what could be the issue here and how I could fix it. I run mammoth on windows 10

mwilliamson commented 7 years ago

Did you manage to solve your issue?

GitBruno commented 7 years ago

OK let's close or action this issue!

First it seems that the encoding error doesn't come from python, but from the encoding that the console is using. So the way to fix it is to run the command (in windows):

chcp 65001

that sets the encoding to UTF-8 and then run Mammoth again. Or if working on pycharm, go to Settings>Editor>File Encodings and set the IDE and Project encodings accordingly.

Source

Now the issue of the symbols. (Which are not recognised by Mammoth)

Symbols are specified with the w:sym element within the w:r element. A symbol is a special character that does not use any of the run fonts specified in rFonts or in the style hierarchy. The character is determined by pulling the hexadecimal value specified in the char attribute from the font specified in the font attribute. The char attribute specifies the hexadecimal code for the Unicode character value of the symbol. The value can be stored in either of the following formats:

Only Unicode characters are officially supported in HTML and only those should be used, as not all browsers will have fonts such as Wingdings and is outside the scope of Mammoth

It sounds we can do two things.

  1. Keep Mammoth as is: ignore w:sym (Close issue)
  2. Create a dictionary to convert the windings characters to unicode equivalents as good as we can. (Hairy! Not recommended.) see list here
mwilliamson commented 6 years ago

Closing since I don't think there's anything further to investigate without more details of the unicode error.