Improve control over Pandoc reader extensions

hongyuanjia commented 7 months ago

Bug description

Pandoc markdown syntax extension east_asian_line_breaks is quite helpful for writing markdown with CJK characters wrapped in a long paragraph. However, it seems that Quarto will just ignore this extension. While R Markdown with md_extensions: "+east_asian_line_breaks" works fine.

Steps to reproduce

The test Quarto file:

---
format:
    markdown:
        from: "markdown+east_asian_line_breaks"
---

你
好

a
b

The test R Markdown file:

---
output:
    md_document:
        md_extensions: "+east_asian_line_breaks"
---
你
好

a
b

Expected behavior

你
好

The above should be put in a single line without any space between 你 and 好.

The output from RMarkdown file:

你好

a b

Actual behavior

The output from the Quarto file:

---
toc-title: Table of contents
---

你 好

a b

Your environment

IDE: Neovim
OS: MacOS

But I confirmed this behaviour is the same across all platforms, including Windows, Linux and MacOS.

Quarto check output

❯ quarto check
Quarto 1.4.549
[✓] Checking versions of quarto binary dependencies...
      Pandoc version 3.1.11: OK
      Dart Sass version 1.69.5: OK
      Deno version 1.37.2: OK
[✓] Checking versions of quarto dependencies......OK
[✓] Checking Quarto installation......OK
      Version: 1.4.549
      Path: /Applications/quarto/bin

[✓] Checking tools....................OK
      TinyTeX: (external install)
      Chromium: (not installed)

[✓] Checking LaTeX....................OK
      Using: TinyTex
      Path: /Users/hongyuanjia/Library/TinyTeX/bin/universal-darwin
      Version: 2023

[✓] Checking basic markdown render....OK

[✓] Checking Python 3 installation....OK
      Version: 3.9.6
      Path: /Library/Developer/CommandLineTools/usr/bin/python3
      Jupyter: (None)

      Jupyter is not available in this Python installation.
      Install with python3 -m pip install jupyter

[✓] Checking R installation...........OK
      Version: 4.3.2
      Path: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources
      LibPaths:
        - /Users/hongyuanjia/Library/R/arm64/4.3/library
        - /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
      knitr: 1.45
      rmarkdown: 2.25

[✓] Checking Knitr engine render......OK

cderv commented 7 months ago

We have a specific reader now in Quarto, and at first I thought we were not taking extension correctly, but I can confirm we do see user provided extension and we do take into account.

Simple test:

Markdown	HTML
````markdown --- format: html: from: markdown+emoji --- :heart: ````	![image](https://github.com/quarto-dev/quarto-cli/assets/6791940/c4977f0a-690f-4452-849e-25b67b6c51d8)

We do take into account those extension in Lua when calling pandoc.read() with format as a table with format string, and extensions tables.

I am wondering if east_asian_line_breaks is an extension that specific and do not behave like others. Pandoc will have the same native representation when parsing

> quarto pandoc -f markdown+east_asian_line_breaks -t native index.qmd
[ Para [ Str "\20320" , Str "\22909" ]
, Para [ Str "a" , SoftBreak , Str "b" ]
]
> quarto pandoc -f markdown -t native index.qmd
[ Para [ Str "\20320" , SoftBreak , Str "\22909" ]
, Para [ Str "a" , SoftBreak , Str "b" ]
]

So maybe there are other condition for it to working, or something specific is required for Pandoc to apply it.

I'll try to look in Pandoc code base.

cc @cscheid if you have an idea.

cderv commented 7 months ago

On Pandoc's side, the east asian line break filter happens at https://github.com/jgm/pandoc/blob/9ab0ffb99e5474b1b3af05902051a8f8baea1167/src/Text/Pandoc/App.hs#L247-L253

Which is applied in the process of reading input https://github.com/jgm/pandoc/blob/9ab0ffb99e5474b1b3af05902051a8f8baea1167/src/Text/Pandoc/App.hs#L296-L300

I am starting to wonder if this is compatible / working with custom reader that we leverage. 🤔

cderv commented 7 months ago

I think I understand the issue now.

For custom reader in Lua, extension are expected to be passed like string format, after the file name as detailed in https://pandoc.org/custom-readers.html#format-extensions

The users control extensions as usual, e.g., pandoc -f my-reader.lua+citations. The extensions are accessible through the reader options’ extensions field

So internally, tweaking our internal default file to have

format: <path>/to/qmd-reader.lua+east_asian_line_breaks

Git diff of the tested change

````diff diff --git a/src/command/render/pandoc.ts b/src/command/render/pandoc.ts index 0f5721945..3a4b40f8b 100644 --- a/src/command/render/pandoc.ts +++ b/src/command/render/pandoc.ts @@ -848,10 +848,12 @@ export async function runPandoc( } // set up the custom .qmd reader + let extensions = ""; if (allDefaults.from) { formatFilterParams["user-defined-from"] = allDefaults.from; + extensions = parseFormatString(allDefaults.from).variants.join(); } - allDefaults.from = resourcePath("filters/qmd-reader.lua"); + allDefaults.from = resourcePath("filters/qmd-reader.lua") + extensions; ````

fix the issue as I correctly get for the same input file

The east_asian_line_breaks will be added to opts.extensions in the Reader()

@cscheid we may need to revisit how user defined extensions are considered. It seems not all can be handled only in pandoc.read(). Unless, there are limits that will prevent support for some of the extensions.

Hope it helps !

TomBener commented 5 months ago

I wrote a Lua filter to emulate Pandoc’s extension east_asian_line_breaks in Quarto:

-- Ignore soft break adjacent to Chinese characters
-- Reference: https://taoshu.in/unix/markdown-soft-break.html

function is_ascii(char)
  if char == nil then return false end
  local ascii_code = string.byte(char)
  -- Check whether a character is an ASCII character
  return ascii_code >= 0 and ascii_code <= 127
end

return {
  {
    Para = function(para)
      local cs = para.content
      for k, v in ipairs(cs) do
        if v.t == 'SoftBreak' and cs[k - 1] and cs[k + 1] then
          local p = cs[k - 1].text
          local n = cs[k + 1].text
          -- Remove SoftBreak if at least one adjacent character is non-ASCII
          if p and n and (not is_ascii(p:sub(-1)) or not is_ascii(n:sub(1, 1))) then
            para.content[k] = pandoc.Str("")
          end
        end
      end
      return para
    end,
  }
}

It is really rough as it assumes that any non-ASCII characters in your documents are East Asian languages CJK. As I write documents only contain English and Chinese, this script works well. If you encounter issues, please leave a comment.

quarto-dev / quarto-cli