jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.39k stars 3.37k forks source link

Extracting Horizontal rules from Word Documents #6285

Closed davidmerfield closed 4 months ago

davidmerfield commented 4 years ago

Pandoc version 2.9.2.1 Compiled with pandoc-types 1.20, texmath 0.12.0.1, skylighting 0.8.3.2 on macOS Sierra 10.12.6

Hello! Thank you for a wonderful tool.

I've run into an issue converting Word Documents (.docx) into Markdown.

I would love to be able to convert horizontal rules in Word Documents. Pandoc is able to generate word documents with horizontal rules from Markdown but not vice versa.

Here's a short command to demonstrate the problem:

echo '---' | pandoc --from=markdown --to=docx | pandoc --from=docx --to=markdown

Output:

Desired output:

---
jgm commented 4 years ago

The xml pandoc produces for a horizontal rule is

<w:p><w:r><w:pict><v:rect style="width:0;height:1.5pt" o:hralign="center" o:hrstd="t" o:hr="t" /></w:pict></w:r></w:p>

I suppose the reader could be made to recognize this for round-trip purposes, but this wouldn't capture other ways of making horizontal rules in Word.

davidmerfield commented 4 years ago

Do you know if there are any ways to make a horizontal rule in Word that are captured by Pandoc? I tried a few and all of them seemed to be ignored by the reader

jgm commented 4 years ago

Not currently.

fsoedjede commented 4 months ago

Hello @jgm

I just encountered this issue and I have a suggestion

From Markdown to Word

As you said,

The xml pandoc produces for a horizontal rule is

<w:p><w:r><w:pict><v:rect style="width:0;height:1.5pt" o:hralign="center" o:hrstd="t" o:hr="t" /></w:pict></w:r></w:p>

From Word

When creating the horizontal rule in word (following: https://support.microsoft.com/en-us/office/insert-a-horizontal-line-9bf172f6-5908-4791-9bb9-2c952197b1a9), the xml produced is:

<w:p><w:pPr><w:pBdr><w:bottom w:val="single" w:sz="6" w:space="1" w:color="auto"/></w:pBdr></w:pPr></w:p>

single could be replaced by double, dotted, thinThickThinMediumGap, wave, etc. An exhaustive list can be found here http://officeopenxml.com/WPborders.php. For Pandoc, single should be kept=;

Based on that, it could be possible to convert to native while keeping the HorizontalRule.

WDYT?

jgm commented 4 months ago

I like the idea of using Word's standard method for pandoc's HorizontalRule output.

jgm commented 4 months ago

I tried this, and using Word's recommended method doesn't look as good. Because it's a bottom border, the horizontal line is closer to the following paragraph than the preceding paragraph, whereas with pandoc's method it is centered. Also, with Word's method, you can put the cursor on the horizontal rule and start typing paragraph content, which is odd. I think that if we do implement HorizontalRule in the docx reader, we should support both styles, but I still prefer pandoc's method for the writer.

jgm commented 4 months ago

This is what it looks like:

Screenshot 2024-06-01 at 10 05 29 AM
fsoedjede commented 4 months ago

Thanks for the fast reply. The implemented solution is good for me I will test the nightly build as soon as possible