mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
811 stars 121 forks source link

Numbered lists where the start number is not 1 #27

Open dividor opened 7 years ago

dividor commented 7 years ago

Hi there,

We have some legacy documents, where the authors have started a numbered list at "1", then entered a bulleted list, table, then another numbered list item where the number is set to '2'. When parsing with Mammoth, this second numbered list item is set to "1".

I tried not setting freshness ...

p[style-name='Numbered List'] => ol > li

But no luck.

Is there a way to persist the numbering from the word document please?

thanks!

mwilliamson commented 7 years ago

Unfortunately not. If there's a sensible way to implement this, then pull requests are welcome.

dividor commented 7 years ago

Just to note, I resolved this with some really hairy parsing using Beautiful soup, so wouldn't need a Mammoth fix at this time.

dividor commented 7 years ago

In case it's useful, attached is a text document to illustrate.

Continuing_lists.docx

dividor commented 7 years ago

Just to note - we're living without this feature just fine. Feel free to close.

zt50tz commented 5 years ago

I have the same need.

I tried to add num_id param to class _NumberingLevel and set it in read_numbering_xml_element. Here is the test code:

def read_numbering_xml_element(element):
    abstract_nums = _read_abstract_nums(element)
    nums = _read_nums(element, abstract_nums)
    not_abstract_num_ids = set(nums) - set(abstract_nums)
    for not_abstract_num_id in not_abstract_num_ids:
        for level in nums[not_abstract_num_id]:
            nums[not_abstract_num_id][level].num_id = not_abstract_num_id
    return Numbering(nums)

So, in transform_document i can get num_id in numbering param of element. And can try to do something with it.

def transform_items(element):
    if isinstance(element, documents.Paragraph):
        if element.numbering and element.numbering.num_id:
            print element
Paragraph(...,
  style_id=u'aa',
  style_name=u'List Paragraph',
  numbering=_NumberingLevel(level_index='0', is_ordered=True, num_id=u'41'),
  alignment=None,..
)

But if this line executes:

nums[not_abstract_num_id][level].num_id = not_abstract_num_id

HTML writer puts p tag instead of ol.

I think, this addition corrupt default style map. What can i do with it? Or it is totally wrong way and I need to take a look on some other things?

Thanks.

TychonautVII commented 4 years ago

Id be interested in this feature too! Getting the number of lists (and what I need in my case is footnotes and references) from word, seems to follow what I understand to be mammoths philosophy of getting the content from word (but not necessarily the style), the specific number in a footnote or list seem to be a content thing!

I tried to do this with the transform api but I don't think I can.

Is there any lower level way to access the content of the word document in mammoth? I'd like to figure out what number they started at