mwilliamson / mammoth.js

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
4.89k stars 531 forks source link

Retain numbering id in list paragraphs #267

Open tripodsan opened 3 years ago

tripodsan commented 3 years ago

Assume you have a document with 2 level 0 ordered lists:

1. one
2. two
Something else
3. three
Something else
1. one
2. two

the numbering information provided in the AST node does not contain the information about the numbering, so it's not possible to continue the first list after the non-list paragraph.

document[8]
├─0 paragraph[1]
│   │ styleId: "ListParagraph"
│   │ numbering: {"isOrdered":true,"level":"0"}
│   └─0 run[1]
│       └─0 text "One"
├─1 paragraph[1]
│   │ styleId: "ListParagraph"
│   │ numbering: {"isOrdered":true,"level":"0"}
│   └─0 run[1]
│       └─0 text "Two"
├─2 paragraph[1]
│   │ styleId: "Normal"
│   └─0 run[1]
│       └─0 text "Something else"
├─3 paragraph[1]
│   │ styleId: "ListParagraph"
│   │ numbering: {"isOrdered":true,"level":"0"}
│   └─0 run[1]
│       └─0 text "Three"
├─4 paragraph[1]
│   └─0 run[1]
│       └─0 text "Something else"
├─5 paragraph[1]
│   │ styleId: "ListParagraph"
│   │ numbering: {"isOrdered":true,"level":"0"}
│   └─0 run[1]
│       └─0 text "One"
├─6 paragraph[1]
│   │ styleId: "ListParagraph"
│   │ numbering: {"isOrdered":true,"level":"0"}
│   └─0 run[1]
│       └─0 text " Two"
└─7 paragraph[0]

If the numId would be added to the numbering information, it would be possible to detect the continuation.

document[8]
├─0 paragraph[1]
│   │ styleId: "ListParagraph"
│   │ numbering: {"isOrdered":true,"level":"0","numId":"1"}
│   └─0 run[1]
│       └─0 text "One"
├─1 paragraph[1]
│   │ styleId: "ListParagraph"
│   │ numbering: {"isOrdered":true,"level":"0","numId":"1"}
│   └─0 run[1]
│       └─0 text "Two"
├─2 paragraph[1]
│   │ styleId: "Normal"
│   └─0 run[1]
│       └─0 text "Something else"
├─3 paragraph[1]
│   │ styleId: "ListParagraph"
│   │ numbering: {"isOrdered":true,"level":"0","numId":"1"}
│   └─0 run[1]
│       └─0 text "Three"
├─4 paragraph[1]
│   └─0 run[1]
│       └─0 text "Something else"
├─5 paragraph[1]
│   │ styleId: "ListParagraph"
│   │ numbering: {"isOrdered":true,"level":"0","numId":"4"}      <<========
│   └─0 run[1]
│       └─0 text "One"
├─6 paragraph[1]
│   │ styleId: "ListParagraph"
│   │ numbering: {"isOrdered":true,"level":"0","numId":"4"}
│   └─0 run[1]
│       └─0 text " Two"
└─7 paragraph[0]
diff --git a/lib/docx/numbering-xml.js b/lib/docx/numbering-xml.js
index 64c4210..a68fbf4 100644
--- a/lib/docx/numbering-xml.js
+++ b/lib/docx/numbering-xml.js
@@ -15,13 +15,16 @@ function Numbering(nums, abstractNums, styles) {
         }),
         "paragraphStyleId"
     );

     function findLevel(numId, level) {
         var num = nums[numId];
         if (num) {
             var abstractNum = abstractNums[num.abstractNumId];
             if (abstractNum.numStyleLink == null) {
-                return abstractNums[num.abstractNumId].levels[level];
+                var lvl = abstractNums[num.abstractNumId].levels[level];
+                return Object.assign({numId: numId}, lvl);
             } else {
                 var style = styles.findNumberingStyleById(abstractNum.numStyleLink);
                 return findLevel(style.numId, level);
hoang commented 3 years ago

hello @tripodsan I face this same issue. Can you please tell your solution for this ?

tripodsan commented 3 years ago

hello @tripodsan I face this same issue. Can you please tell your solution for this ?

I have a fork: https://github.com/adobe-rnd/mammoth.js/tree/bleeding that uses my suggestion above: https://github.com/adobe-rnd/mammoth.js/commit/60a679eb0c0599c7b0f5d2ca83fbb1c55b84c73e

it's released as: https://www.npmjs.com/package/@adobe/mammoth/v/1.4.15-bleeding.1

VictorBaron commented 2 years ago

Hey @tripodsan ! Your solution seems perfect ! Congrats and thanks for sharing it !

Wondering why this is a fork, and not a PR on this repo ? Obviously we'd prefer to use the original package in production.

Could you bother making a PR with your work here ? If not: could I make it myself, using your work ?

Have the best day !

tripodsan commented 2 years ago

hi @VictorBaron. I can't remember why I didn't submit the PR... but I will create one asap.

hmnd commented 2 years ago

@tripodsan Hey, thanks so much for your fix! Just curious if you still plan on making a PR?

tripodsan commented 2 years ago

@tripodsan Hey, thanks so much for your fix! Just curious if you still plan on making a PR?

sorry @hmnd , I was preoccupied with other things.... I'll take a look at it now.

mwilliamson commented 1 year ago

If anyone who's interested in this issue could post a minimal example document, the expected HTML, and the actual HTML, then that would be helpful.

Also, since the suggestion here is just to add the numbering ID, then presumably there are other things being done e.g. a document transform? I'm reluctant just to add the numbering ID without understanding how that actually solves the problem.

tripodsan commented 1 year ago

I'm reluctant just to add the numbering ID without understanding how that actually solves the problem.

it only solves half of the problem - where the document tree is used for further processing (e.g. generating markdown). it doesn't include a solution for the HTML rendering.

mwilliamson commented 1 year ago

I'm reluctant just to add the numbering ID without understanding how that actually solves the problem.

it only solves half of the problem - where the document tree is used for further processing (e.g. generating markdown). it doesn't include a solution for the HTML rendering.

Given this is a library for generating HTML, that feels like a pretty important part!

It would also be useful to see examples of the further processing so that I can understand how this would be used in context.

tripodsan commented 1 year ago

Given this is a library for generating HTML, that feels like a pretty important part!

I think it's a great library for parsing docx and turning it in a syntax tree. the HTML generation is a nice side effect :-)

It would also be useful to see examples of the further processing so that I can understand how this would be used in context.

it is a bit complicated to explain in a short code snippet (I invited you to our repo)....

Anyways, I will come up with a PR that includes the OL support with numbering problems across lists.

mwilliamson commented 1 year ago

Anyways, I will come up with a PR that includes the OL support with numbering problems across lists.

I'm not generally accepting pull requests at the moment, since it usually ends up taking more time and effort (due to rounds of review, and having to port the changes to multiple implementations). Discussions of the high level approach are welcome though.

kiejo commented 1 year ago

I'm running into a similar issue where I would like the generated HTML to handle list continuations. I think the most semantic way to address this in HTML would be to use the start attribute for the ol elements.

Input:

1. one
2. two
Something else
3. three
4. four
Something else
1. one
2. two

Expected output:

<ol>
  <li>one</li>
  <li>two</li>
</ol>
<p>Something else</p>
<ol start="3">
  <li>three</li>
  <li>four</li>
</ol>
<p>Something else</p>
<ol>
  <li>one</li>
  <li>two</li>
</ol>

The high level logic would be to set the start attribute of any ol element that does not start with 1.

inimeseke commented 5 months ago

Any updates on this? It's been over half a year since the last comment and over 1.5 years since the last maintainer commented on this.

mwilliamson commented 5 months ago

As above, minimal example documents, along with the actual and expected HTML, would be helpful.

inimeseke commented 5 months ago

I believe that issue #394 contains these examples.

Like @kiejo wrote above, the high level logic would be to set the start attribute of any <ol> element that does not start with 1.