Open hongyuanjia opened 7 months ago
We have a specific reader now in Quarto, and at first I thought we were not taking extension correctly, but I can confirm we do see user provided extension and we do take into account.
Simple test:
Markdown | HTML |
````markdown --- format: html: from: markdown+emoji --- :heart: ```` | ![image](https://github.com/quarto-dev/quarto-cli/assets/6791940/c4977f0a-690f-4452-849e-25b67b6c51d8) |
We do take into account those extension in Lua when calling pandoc.read()
with format as a table with format string, and extensions tables.
I am wondering if east_asian_line_breaks
is an extension that specific and do not behave like others. Pandoc will have the same native representation when parsing
> quarto pandoc -f markdown+east_asian_line_breaks -t native index.qmd
[ Para [ Str "\20320" , Str "\22909" ]
, Para [ Str "a" , SoftBreak , Str "b" ]
]
> quarto pandoc -f markdown -t native index.qmd
[ Para [ Str "\20320" , SoftBreak , Str "\22909" ]
, Para [ Str "a" , SoftBreak , Str "b" ]
]
So maybe there are other condition for it to working, or something specific is required for Pandoc to apply it.
I'll try to look in Pandoc code base.
cc @cscheid if you have an idea.
On Pandoc's side, the east asian line break filter happens at https://github.com/jgm/pandoc/blob/9ab0ffb99e5474b1b3af05902051a8f8baea1167/src/Text/Pandoc/App.hs#L247-L253
Which is applied in the process of reading input https://github.com/jgm/pandoc/blob/9ab0ffb99e5474b1b3af05902051a8f8baea1167/src/Text/Pandoc/App.hs#L296-L300
I am starting to wonder if this is compatible / working with custom reader that we leverage. 🤔
I think I understand the issue now.
For custom reader in Lua, extension are expected to be passed like string format, after the file name as detailed in https://pandoc.org/custom-readers.html#format-extensions
The users control extensions as usual, e.g., pandoc -f my-reader.lua+citations. The extensions are accessible through the reader options’ extensions field
So internally, tweaking our internal default file to have
format: <path>/to/qmd-reader.lua+east_asian_line_breaks
fix the issue as I correctly get for the same input file
The east_asian_line_breaks
will be added to opts.extensions
in the Reader()
@cscheid we may need to revisit how user defined extensions are considered. It seems not all can be handled only in pandoc.read()
. Unless, there are limits that will prevent support for some of the extensions.
Hope it helps !
I wrote a Lua filter to emulate Pandoc’s extension east_asian_line_breaks
in Quarto:
-- Ignore soft break adjacent to Chinese characters
-- Reference: https://taoshu.in/unix/markdown-soft-break.html
function is_ascii(char)
if char == nil then return false end
local ascii_code = string.byte(char)
-- Check whether a character is an ASCII character
return ascii_code >= 0 and ascii_code <= 127
end
return {
{
Para = function(para)
local cs = para.content
for k, v in ipairs(cs) do
if v.t == 'SoftBreak' and cs[k - 1] and cs[k + 1] then
local p = cs[k - 1].text
local n = cs[k + 1].text
-- Remove SoftBreak if at least one adjacent character is non-ASCII
if p and n and (not is_ascii(p:sub(-1)) or not is_ascii(n:sub(1, 1))) then
para.content[k] = pandoc.Str("")
end
end
end
return para
end,
}
}
It is really rough as it assumes that any non-ASCII characters in your documents are East Asian languages CJK. As I write documents only contain English and Chinese, this script works well. If you encounter issues, please leave a comment.
Bug description
Pandoc markdown syntax extension
east_asian_line_breaks
is quite helpful for writing markdown with CJK characters wrapped in a long paragraph. However, it seems that Quarto will just ignore this extension. While R Markdown withmd_extensions: "+east_asian_line_breaks"
works fine.Steps to reproduce
The test Quarto file:
The test R Markdown file:
Expected behavior
The above should be put in a single line without any space between
你
and好
.The output from RMarkdown file:
Actual behavior
The output from the Quarto file:
Your environment
But I confirmed this behaviour is the same across all platforms, including Windows, Linux and MacOS.
Quarto check output