mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
785 stars 121 forks source link

Match explicitly unbolded text? #112

Closed deltamacht closed 2 years ago

deltamacht commented 2 years ago

I'm working with a document that uses a style which is bolded. However, the user then explicitly unbolded the text. Because of this, if I try to to understand the style and build a style map I get a mapping which bolds the text in question even though it shouldn't be.

I know

b => strong

maps explicitly bolded text. But is there an option for explicitly unbolded text?

deltamacht commented 2 years ago

Looking through the mammoth source code I'm getting the sense that that this might not be easy. Seems if it's explicitly unbolded than the run's is_bold value is going to be False, and there's no distinguishing between run's with no <w:b> attribute and those that have a <w:b> attribute with a w:val="0". I could modify mammoth to do something if is_bold is not True, but then that would require a lot of cleanup downstream for me. Please let me know if there's a better way that I'm not seeing.

deltamacht commented 2 years ago

I'm going to close this because I don't think it's supported in the current code. I think you need a tri-state value to indicate styles which are unconditionally off. Fortunately, the package python-docx stores this information and I was able to parallel parse the document with that to identify and correct the HTML created by Mammoth when this situation occurs.