mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
810 stars 121 forks source link

Word style that starts with a digit doesn't work in style_map.txt #110

Closed peter-dobson-ds closed 3 years ago

peter-dobson-ds commented 3 years ago

I need to convert a word document that has a lot of styles with names "10sp0", "10spHanging05", "10spHanging1", "10spHanging15", "10spLeftInd1", "10spLeftInd15", "15sp0", "15sp1", "15spHanging15"

I know that classes in HTML can't start with a digit, so I prefixed them all with a letter.

My style map includes:

p.10sp0 => p.a10sp0:fresh
p.10spHanging05 => p.a10spHanging05:fresh
p.10spHanging1 => p.a10spHanging1:fresh
p.10spHanging15 => p.a10spHanging15:fresh
p.10spLeftInd1 => p.a10spLeftInd1:fresh
p.10spLeftInd15 => p.a10spLeftInd15:fresh
p.15sp0 => p.a15sp0:fresh
p.15sp1 => p.a15sp1:fresh
p.15spHanging15 => p.a15spHanging15:fresh

Using this style map, mammoth returns warnings for every line in the style_map, reading

Did not understand this style mapping, so ignored it: p.10sp0 => p.a10sp0:fresh

(of course the last part if the warning is different for each row.)

Here's the word document and my style_map.txt file:

mwilliamson commented 3 years ago

Have you tried matching by style name instead of style ID?

peter-dobson-ds commented 3 years ago

I looked a little deeper -

I built my style map based on messages like:

'warning: Unrecognised paragraph style: _1.0sp 0" (Style ID: 10sp0)'

And I used the data in Style ID: to create the row in the style map

It turns out the style has a _ prefix, and if I use the style ID _10sp0 it works the way I want it to.