jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.88k stars 3.39k forks source link

Text from Shape Format is not extracted #9214

Closed StephanMeijer closed 1 year ago

StephanMeijer commented 1 year ago

Explain the problem.

Text in Shape Format is not extracted

Example:

Screenshot Example of Document with text in a Shape Format
document.xml ```xml center bottom Last update: SAVEDATE \@ "MMMM d, yyyy" \* MERGEFORMAT May 1, 2017 100000 0 Last update: SAVEDATE \@ "MMMM d, yyyy" \* MERGEFORMAT May 1, 2017 center center U sing Microsoft Word 200 7/2010 for Writing Technical Documents Valter Kiisk Institute of Physics, University of Tartu 100000 0 U sing Microsoft Word 200 7/2010 for Writing Technical Documents Valter Kiisk Institute of Physics , University of Tartu ```

MsWord.docx

Pandoc version: 3.1.9

jgm commented 1 year ago

You're converting from what to what?

What do you mean by "shape format"?

StephanMeijer commented 1 year ago

From Docx to HTML.

image

Shape Format is some Microsoft Word feature allowing user for freely positioning text, using Word art, positioning images, among others. A feature that probably shouldn't be used.

Currently working on a PR for Pandoc to investigate and if possible fix.

StephanMeijer commented 1 year ago

This would probably require some extreme measures in src/Text/Pandoc/Readers/Docx/Parse.hs as logic has to be changed: A w:p can also contain more paragraphs, not only runs..

StephanMeijer commented 1 year ago

More info can be found in ECMA-376 Part 1

image image
jgm commented 1 year ago

Closed by #9223 - I accidentally hit enter before finishing the description of the squashed commit.

StephanMeijer commented 1 year ago

@jgm many thanks for merging! I will publish some test-cases and possible fixes on my code for VML-based images probably tomorrow to make sure those are still supported within context of shape format.