mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
810 stars 121 forks source link

Extracting section numbering in numbered lists #109

Closed deltamacht closed 2 years ago

deltamacht commented 3 years ago

I'm having what I assume is a simple problem but I have not been able to find the answer in the documentation or elsewhere. I have a file which has numbered sections (of Heading 2 style). When I convert to html these sections get marked with h2 headers as expected but I lose the section numbering. This is problematic because I have other Heading 2 sections that need to be processed differently and I have trouble distinguishing them.

I've attached a very simple minimal working example. Specifically, I want to extract the section numbers 1.1, 1.2, etc. from the document. These sections are currently output in the html as h2 sections but without the section numbers. I'm generating the html for comparison using the code:

import mammoth
from bs4 import BeautifulSoup

word_doc_file = 'mwe_word_doc.docx'
style_map = "u => u"
with open(word_doc_file, 'rb') as docx_file:
    result = mammoth.convert_to_html(docx_file, style_map=style_map)
    html = result.value
with open('mwe.html', 'w') as f:
    f.write(str(BeautifulSoup(html, 'html.parser').prettify()))

Is there an option I need to specify to get those section numbers as well?

mwe_word_doc.docx

deltamacht commented 3 years ago

I just saw this closed issue (https://github.com/mwilliamson/python-mammoth/issues/56), which gives me the impression that perhaps this cannot be done with Mammoth. Is this still the case? If so, even if I can't extract the section numbers, would it be possible to detect that these items were 'styled' and thus perhaps part of some unknown numbering?

mwilliamson commented 3 years ago

It depends on exactly what styles you're using in your document and what output you want, but if you want the output to be an HTML list, you could use a style map along the lines of:

p[style-name='Heading 1']:unordered-list(1) => ul > li:fresh > h1:fresh
p[style-name='Heading 2']:unordered-list(2) => ul > li:fresh > h2:fresh

The problem with doing something like this is that the paragraphs between headings aren't part of the list in the original document, but you probably want them to part of the list in the outputted HTML, so you'd need a style to identify paragraphs that should be part of the list. For instance:

p[style-name='Heading 1']:unordered-list(1) => ul > li:fresh > h1:fresh
p[style-name='Heading 2']:unordered-list(2) => ul > li > ul > li:fresh > h2:fresh
p[style-name='Section 2 content'] => ul > li > ul > li > p:fresh

If instead you want the section headings in text, then there's currently no support for that since the actual numbering is calculated rather than being in the actual text of the document. Which isn't to say that Mammoth couldn't reproduce that logic, but that it doesn't do so currently. If that's the behaviour you want, the best answer at the moment would be to add some post processing, for instance by finding all h1 elements and prepending an increment integer.

deltamacht commented 2 years ago

Thank you for the information, @mwilliamson. I have a follow up question if you don't mind. Could you briefly comment on how unordered-list(1) is different from unordered-list(2)?

These answers will also help me with a new problem I'm dealing with. I have a test document attached nested_list_doc.docx

Parsing this file with mammoth gives me:

<ol>
 <li>
  Section 1
  <ol>
   <li>
    Section 1.1
   </li>
  </ol>
 </li>
</ol>
<p>
 This is normal text. We’ll add some bullets in here:
</p>
<ul>
 <li>
  Comment 1
 </li>
 <li>
  Comment 2
 </li>
 <li>
  Comment 3
  <ol>
   <li>
    Section 1.2
   </li>
  </ol>
 </li>
</ul>
<p>
 Here is some more normal text for s 1.2
</p>
<ol>
 <li>
  Section 2
 </li>
</ol>
<p>
 Section 2 text stuff
</p>

As you can see "Section 1.1" is a sub-ordered list of Section 1. However, I'd like it to show up as a new <ol> element like so:

<ol>
 <li>
  Section 1
</li>
</ol>
<ol>
   <li>
    Section 1.1
   </li>
 </ol>

I can get this effect by creating a style map item p.ListParagraph => ol:fresh > li:fresh. However, that doesn't do what I want because that turns the bulleted comments into

<ol>
 <li>
  Comment 1
 </li>
</ol>
<ol>
 <li>
  Comment 2
 </li>
</ol>
<ol>
 <li>
  Comment 3
 </li>
</ol>

whereas I was happy with the original mapping from the default style map with regards to the <ul> elements. Is it possible to achieve what I want here? I don't fully understand the style map specs with regards to ListParagraph versus things like ordered-list(2) versus the p.levelone type specs which were listed in a solution here. Can you please educate me a bit on these differences and how I might achieve what I would like above? It seems like I need to be able to distinguish between ordered lists and unordered lists when creating the mapping, but I'm not sure if that's possible.

Possibly relatedly, can you explain the syntax used in the file options.py:

p:unordered-list(2) => ul|ol > li > ul > li:fresh

In particular, what does the ul | ol syntax mean?

deltamacht commented 2 years ago

Ah, with some iteration I think I figured out what these things must mean and was able to figure it out for my use case. Closing.

peter-dobson-ds commented 2 years ago

@deltamacht How did you resolve this issue, I see that the data I need is in the html. I'm working on code to digest the html, and modify it. I just thought you might have a snippet of code that might help me.

deltamacht commented 2 years ago

@deltamacht How did you resolve this issue, I see that the data I need is in the html. I'm working on code to digest the html, and modify it. I just thought you might have a snippet of code that might help me.

I sort of hijacked this thread with a couple different questions that I was looking at, so just to be clear, what issue are you trying to resolve specifically? You are trying to get section numbers? If they are in the html then it sounds like Word document had these explicitly typed as content (rather than using a Word list style). If so, then you just need to parse the html with something like BeautifulSoup and grab whatever you need. If that's not what you are trying to do, please clarify and I'll see if I can offer some help.

peter-dobson-ds commented 2 years ago

I figured out the solution for my issue.

I'm writing code that converts collections of word documents that make up a print publication into many small .htm files, and a few data files that feed into our CMS for online book publishing at ceb.com. When the authors publish an update to the book, I run the whole process again.

First the word files are run through mammoth, and then as you suggested, Beautiful Soup. I figured out whenever at see an <ol> or <ul> at the top level, and inside it there's an anchor then I know it's a marker for numbered sections. I then parse each of these up, to get first level, second level, etc. section names. That probably as far as I need to go, the html between each of these blocks gets written to a different HTML file, and a few lines are prefixed on the HTML file to show the section headings. I'll have to recalculate the numbers for the section headings, and do a few things to keep a table of contents organized.

I'm really pleased with mammoth as a tool for helping with this work. It's so much better than previous tools - export from Word to .htm and then manipulate with Java and JSoup.