Open TomBener opened 5 months ago
Do you mean disable them globally or in a fine-grained way (e.g. don't put a font hint inside this specially marked span) ?
Disable East Asian font hints globally would be fine, just like the previous version (like Pandoc 3.2).
I'm confused, because you requested this feature in the first place, but when I implemented it you immediately asked for a way to disable it. Is it actually a useful feature?
I understand your confusion. This is indeed an annoying case, especially for non-CJK users.
Do you mean disable them globally or in a fine-grained way (e.g. don't put a font hint inside this specially marked span) ?
Specifying the language manually would be feasible, but it is hard to do so for bibliographies.
Would it make sense to add the font hints only when the specified language (e.g., metadata lang
, perhaps overridable at the Div or Span level) is a CJK language?
Would it make sense to add the font hints only when the specified language (e.g., metadata
lang
, perhaps overridable at the Div or Span level) is a CJK language?
Sorry, I don't think this is a good idea as lang
may affect other settings, which are unexpected in some cases. For example, when setting lang: zh-CN
, CSL will use localization that I don't want, as I have reported in an old issue.
So, I think the current implementation of adding East Asian font hints is good and no need to change. Perhaps I could write a Lua filter to remove them when writing English articles if necessary.
I tried to write a Lua filter as follows:
function traverse(elem)
if elem.t == "RawBlock" or elem.t == "RawInline" then
if elem.format == "openxml" then
elem.text = elem.text:gsub('<w:rFonts w:hint="eastAsia" />', '')
end
end
return elem
end
return {
{ RawBlock = traverse },
{ RawInline = traverse }
}
But it didn't work. Could you please help to diagnose it or give some guidance?
A lua filter can't remove these because they are added in the writer. Lua filters only affect the AST (which is the input to the writer).
Thanks for your guidance. Are there any alternative ways?
Nothing will work but postprocessing the docx. (It wouldn't be that hard to find and remove the offending elements from the context in the container.)
Again, I'm open to providing this flexibility in pandoc, but I need to figure out what the best way to do it would be.
Sorry, I don't think this is a good idea as lang may affect other settings, which are unexpected in some cases. For example, when setting lang: zh-CN, CSL will use localization that I don't want, as I have reported in an https://github.com/jgm/pandoc/issues/7022#issuecomment-1238093008.
You needn't set the document-wide lang. We could have the feature be sensitive to a lang on a div, for example. So you could put Chinese content inside
::: {lang=zh}
...
:::
and the Word writer could be trained to add the font hints inside that context (unless overridden by an interior span or div with lang=en).
Thanks. I believe the step for post-processing the docx is feasible.
Regarding the language attribute, I think there is no need to change the current implementation as the East Asian Languages should always be enclosed with eastAsia
font hints, no matter what the document language is. The peculiar need I request here is not usual.
Quick suggestion for post-processing: Using a binary custom Lua writer, i.e., a custom writer that defines a ByteStringWriter
function instead of a Writer
function, can be used to do the post-processing in pandoc itself. The pandoc.zip
module can be used to unpack and re-pack the output of pandoc.write
, and the file entries of the archive can be modified via normal string processing.
the East Asian Languages should always be enclosed with eastAsia font hints, no matter what the document language is.
The difficulty is determining whether quotation marks surrounding a Chinese phrase should themselves be considered East Asian or not. As you've noted, that depends on the context. Hence my suggestion to make this sensitive to language tagging.
Quick suggestion for post-processing: Using a binary custom Lua writer, i.e., a custom writer that defines a
ByteStringWriter
function instead of aWriter
function, can be used to do the post-processing in pandoc itself. Thepandoc.zip
module can be used to unpack and re-pack the output ofpandoc.write
, and the file entries of the archive can be modified via normal string processing.
Sounds a promising method. But I cannot fully understand it. Could you please provide more details? For example, I'd like to remove <w:rPr><w:rFonts w:hint="eastAsia" /></w:rPr>
in document.xml
under the unzipped docx, how to use this method? Thanks!
The difficulty is determining whether quotation marks surrounding a Chinese phrase should themselves be considered East Asian or not. As you've noted, that depends on the context. Hence my suggestion to make this sensitive to language tagging.
All issues come from that simplified Chinese and English use the same quotation mark (Traditional Chinese does not). I think Pandoc does't need to try to handle this tricky issue further.
A Japanese designer has submitted a proposal to add standardized variation sequences for four quotation marks. I hope it can be adopted as soon as possible:
This document is a proposal for adding eight standardized variation sequences (SVSes) for the following four quotation marks that use VS1 (aka U+FE00) and VS2 (aka U+FE01) to distinguish between the forms whose usage varies according to well-established Western versus East Asian conventions:
U+2018 ‘ LEFT SINGLE QUOTATION MARK U+2019 ’ RIGHT SINGLE QUOTATION MARK U+201C “ LEFT DOUBLE QUOTATION MARK U+201D ” RIGHT DOUBLE QUOTATION MARK
Could you please provide more details?
Sure, here we go:
--- file: docx-no-eahints.lua
-- Copyright: © 2024 Albert Krewinkel
-- License: MIT
local mediabag = require 'pandoc.mediabag'
local path = require 'pandoc.path'
local zip = require 'pandoc.zip'
function ByteStringWriter(doc, opts)
local docx = pandoc.write(mediabag.fill(doc), 'docx', opts)
local archive = zip.Archive(docx)
for i, entry in ipairs(archive.entries) do
if path.filename(entry.path) == 'document.xml' then
local pattern = '<w:rPr><w:rFonts w:hint="eastAsia" /></w:rPr>'
local newcontent = entry:contents():gsub(pattern, '')
archive.entries[i] = zip.Entry(entry.path, newcontent)
end
end
return archive:bytestring()
end
Use with
pandoc --to=docx-no-eahints.lua -o my-outfile.docx …
It's not really well-tested, but should work. Or, at the very least, should give a better idea of what I meant, and how this could work.
Thanks @tarleb, it works. But I encounter an issue that the page size was changed from A4 to US Letter after applying the Lua filter. The original XML tags in document.xml
was removed from the reference docx via --reference-doc
:
<w:sectPr w:rsidR="00D3414C">
<w:pgSz w:h="16840" w:w="11900" />
<w:pgMar w:bottom="1440" w:footer="720" w:gutter="0" w:header="720" w:left="1440" w:right="1440" w:top="1440" />
<w:cols w:space="720" />
<w:docGrid w:linePitch="360" />
</w:sectPr>
BTW, is it possible to use this Lua filter with Quarto?
Thanks @tarleb, it works. But I encounter an issue that the page size was changed from A4 to US Letter after applying the Lua filter. The original XML tags in
document.xml
was removed from the reference docx via--reference-doc
:
Using --reference-doc
with the custom writer should still be possible.
BTW, is it possible to use this Lua filter with Quarto?
I don't know, sorry.
I've uploaded a folder with files for testing: lua-custom-writer-test.zip
With the same source input file test.md
and custom.docx
as the reference-doc
, running the command:
pandoc test.md -o test.docx --reference-doc custom.docx
generated test.docx
with the A4 page size. But if running the command:
pandoc test.md -o test.docx --reference-doc custom.docx -t docx-no-eahints.lua
would generate test.docx
with the US Letter page size. By unzipping test.docx
, I was able to confirm that the later conversion removed East Asian font hints, but it also unexpectedly removed the following XML for defining page size:
<w:sectPr w:rsidR="005F2E0E" w:rsidSect="002F2276">
<w:pgSz w:h="16840" w:w="11900" />
<w:pgMar w:bottom="1440" w:footer="720" w:gutter="0" w:header="720" w:left="1440" w:right="1440" w:top="1440" />
<w:cols w:space="720" />
<w:docGrid w:linePitch="326" />
</w:sectPr>
This behavior seems weird and I have no idea what's the problem, could you please help to diagnose the issue @tarleb
Weird. I currently don't have time to debug this, but it would be nice to get to the bottom of this. Does the reference doc get applied at all?
Weird. I currently don't have time to debug this, but it would be nice to get to the bottom of this. Does the reference doc get applied at all?
Never mind, it's not urgent. The reference doc was applied in both conversions. You can see them in the folder above.
@tarleb Can you kindly help to debug the page size issue above?
New issue from #9817.
In my filed, we tend to cite the Chinese sources in articles but they are relatively small in the entire document. So the English journals expect the typesetting to be in line with English instead of Chinese, particularly the quotation mark. In this context, could Pandoc provide an option to disable East Asian font hints?