jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.86k stars 3.34k forks source link

[docx->markdown] code blocks are not detected #5971

Open xuhcc opened 4 years ago

xuhcc commented 4 years ago

I use pandoc to convert docx documents to markdown. These documents contain code blocks, and I apply filter to properly transform them. However, it seems that in newer pandoc versions, code blocks are no longer detected by the parser.

Sample document: document.docx

Command: pandoc document.docx -f docx -t markdown_strict -s -o document.md

Result with pandoc 2.7.2:

Text:

>     SELECT
>       account,
>       YEAR(date) AS year,
>       SUM(COST(position)) AS balance
>     WHERE
>        currency = 'USD'
>     ORDER BY 1,2;

Text text text.

Result with pandoc 2.8.1:

Text:

> SELECT  
> account,  
> YEAR(date) AS year,  
> SUM(COST(position)) AS balance  
> WHERE  
> currency = 'USD'  
> ORDER BY 1,2;

Text text text.
jgm commented 2 years ago

It seems that perhaps in the past we parsed paragraphs with style SourceCode as code blocks. But this stopped working. There's a comment in the docx reader that suggests that it should work:

25:  - [X] CodeBlock (styled with `SourceCode`)

So something broke this. Need to look into it.

jgm commented 2 years ago

OK, it does actually work -- see test/docx/codeblocks.docx

The reason this test file works and the above document.docx does not is that the word/style.xml components of the docx containers differ.

In codeblocks.docx (working) we have

<w:style w:type="paragraph" w:customStyle="1" w:styleId="SourceCode">
<w:name w:val="Source Code" />
<w:basedOn w:val="Normal" />
<w:link w:val="VerbatimChar" />
<w:pPr>
<w:wordWrap w:val="off" />
</w:pPr>
</w:style>

while in document.xml we have

<w:style w:type="paragraph" w:customStyle="1" w:styleId="SourceCode"><w:name w:val="SourceCode"/></w:style>

If we change the latter so that we have

<w:name w:val="Source Code"/>

(note the space) then it works.

Pandoc is looking for a style with the name Source Code, not a style with the id (or name) SourceCode.

jgm commented 2 years ago

Not sure if this is really a bug, since pandoc does have a way of recognizing source code. But we could perhaps also react to SourceCode as the style name.

masters3d commented 2 years ago

The following would be nice to have SourceCode| source_code| Source Code And also be documented somewhere

xuhcc commented 2 years ago

Thanks for looking into it. I can confirm that workaround works

pedropaulofb commented 1 year ago

I am sorry to make this noob question, but how can pandoc recognize a block code in a docx document? Do I have to create a style in Word called "Source Code"?

I am trying to convert a docx document with code blocks to markdown, but the results are always with backslahes (\) before the backticks (`) chars.

Thank you in advance

masters3d commented 1 year ago

Yes correct.

pedropaulofb commented 1 year ago

Thanks @masters3d for the information. I have tried in so many ways, but none has worked. Is there an example where I can download a file with the correct style to be used?

pedropaulofb commented 1 year ago

In addition, is there a way to directly indicate that the code is on a certain format (e.g., using {.python})?