jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.73k stars 3.39k forks source link

Compensate for merged table cells in docx #4672

Open mako4 opened 6 years ago

mako4 commented 6 years ago

Since pandoc doesn't support merged cells in the native intermediate, any merged cells in docx are lost - this is fine, however the problem is that it breaks the cell alignment for all following cells in the same row (or, in formats like asciidoc, for every cell after it in the entire table, as there is no explicit row delimiter).

It would be great if the docx reader would inject empty cells to compensate for any merged cells to keep the overall table structure preserved. Note that this only affects horizontal merging - for vertical merging, Word already includes placeholder elements in the affected rows that pandoc just treats as empty cells, so it works out of the box.

I guess it would require a change in the elemToCell / elemToRow functions in Readers/Docx/Parse.hs.

Regarding the technical details, docx uses a gridspan element to indicate merged cells:

<w:tc>
  <w:tcPr>
    <w:tcW w:w="3132" w:type="dxa"/>
    <w:gridSpan w:val="2"/>
  </w:tcPr>
  <w:p w:rsidR="009C738A" w:rsidRDefault="009C738A" w:rsidP="009C738A">
    <w:r><w:t>B1:C1</w:t></w:r>
  </w:p>
</w:tc>

Sample file: Pandoc_table_test.docx

Vittaly commented 6 years ago

Good day! I have the some problem too

When I convert html to docx, table with cell thet has colspan property will be lost.

dabro666 commented 5 years ago

Hi, @mako4, @Vittaly,

I don't know Haskell, so I can't rewrite docx reader/writer to solve your problems. But I have some workarounds which can help you.

I have python script for docbook(xml) -> docx convertion. As I found, pandoc ignores number of merged cells (namest, nameend, morerows in docbook or colspan, rowspan in html), so it do not create cells for span in docx at all, empty cells will be automatically added by MSWord in the end of row. You can simply add empty cells into correct places in xml/html and create docx. Then you can use python-docx to merge cells.

Also I have python script to convert all tables in docx to asciidoc. For me it takes less than 5 minutes for docx with 500+ pages and 500+ tables.

Scripts are not clear, but works, so I can share them.

wiejakp commented 3 years ago

@dabro666 , can I please ask for your script?

dabro666 commented 3 years ago

@wiejakp, which one do you need? docbook->docx or docx->asciidoc?