plutext / docx4j-ImportXHTML

Converts XHTML to OpenXML WordML (docx) using docx4j
135 stars 124 forks source link

Table with long content gets cut off at the bottom of the page #78

Closed fcjm closed 2 years ago

fcjm commented 2 years ago

Hi,

we found a bug with the current 8.2.1 release. The bug appears when the content in a "td" of a "table" element gets too long, resulting in it getting cut off at the bottom so that the contents are invisible to the user. It seems like this is only a problem with tables containing multiple columns, since for tables with only one column I was not able to reproduce this issue.

I have done some research on this and created a reproducing example. You can find it under https://github.com/fcjm/docx4jImportXHTMLExample. My example simply takes some input .xhtml files located in the resources/input folder and converts them to .docx documents and puts them in the resources/output folder. My reproducer comes with 5 example cases. One of them comes with a single-column table (ex3.xhtml). As said before, for that case everything behaves like expected.

With this lightweight reproducer i was also able to check this case for older versions and it appears that the breaking change happened somewhere in between v8.0.0 and v8.2.0. Unfortunately i was not able to get every commit to build locally on my PC, but i suspect it might have something to do with commit 5e952c5a and its changes to both the XHTMLImporterImpl and TableHelper. But again, i am not 100% sure about this.

EXPECTED (with v8.0.0)

docx4j-1

ACTUAL (with v8.2.1)

docx4j-3

Is there any chance that a fix for this could be implemented in a future release? For now we will keep working with the v8.0.0 release.

fcjm commented 2 years ago

So i have taken another look at it, more specificly at the resulting word/document.xml. There are differences from v8.0.0 to v8.2.1 when comparing the trHeight property. In v8.0.0 this value has always been 30, while from v8.2.0 onwards its value is now 35310.

diff

I tried changing the trHeight property back to 30 and it fixed the issue (replaced by hand in the resulting document.xml).

Coming from that, i searched for the trHeight property in the project and where it is used. It is set in the TableHelper.setupTrPr() method. Debugging shows that the trBox used in this function comes with a large height, which then gets used to calculate the height of the row.

plutext commented 2 years ago

Looking into this, the incoming units are still 1/20 of a pixel, so the code is fundamentally ok.

What we could add is a check for very large values, for example if the trHeight is greater than the page height, then omit it (ie leave it up to Word to determine based on the contents).

fcjm commented 2 years ago

Thank you for the answer.

I have to add, that in some rare cases within our project the trHeight gets too big for its content, but not greater than the page height. Unfortunately though, i was not able to reproduce this within a small example, so it might be caused somewhere on our end.

We solved all of this by commenting out the code which was responsible for setting the trHeight. Now Word is always responsible for determining the height of the rows and it all acts like it should again.