jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.14k stars 3.3k forks source link

Support to disable East Asian font hints in docx output #9910

Open TomBener opened 1 week ago

TomBener commented 1 week ago

New issue from #9817.

In my filed, we tend to cite the Chinese sources in articles but they are relatively small in the entire document. So the English journals expect the typesetting to be in line with English instead of Chinese, particularly the quotation mark. In this context, could Pandoc provide an option to disable East Asian font hints?

jgm commented 1 week ago

Do you mean disable them globally or in a fine-grained way (e.g. don't put a font hint inside this specially marked span) ?

TomBener commented 1 week ago

Disable East Asian font hints globally would be fine, just like the previous version (like Pandoc 3.2).

jgm commented 1 week ago

I'm confused, because you requested this feature in the first place, but when I implemented it you immediately asked for a way to disable it. Is it actually a useful feature?

TomBener commented 1 week ago

I understand your confusion. This is indeed an annoying case, especially for non-CJK users.

  1. When writing an article primarily in Chinese (or Japanese, Korean), there would be some ASCII characters in almost all cases, so I want the East Asian characters to be inclosed with specific font attributes, as implemented currently.
  2. When writing an article primarily in English but a few CJK characters are included, I don’t want to enclose East Asian font hints for CJK texts to ensure punctuations (such as quotation marks) are consistent in the whole document.

Do you mean disable them globally or in a fine-grained way (e.g. don't put a font hint inside this specially marked span) ?

Specifying the language manually would be feasible, but it is hard to do so for bibliographies.

jgm commented 1 week ago

Would it make sense to add the font hints only when the specified language (e.g., metadata lang, perhaps overridable at the Div or Span level) is a CJK language?

TomBener commented 1 week ago

Would it make sense to add the font hints only when the specified language (e.g., metadata lang, perhaps overridable at the Div or Span level) is a CJK language?

Sorry, I don't think this is a good idea as lang may affect other settings, which are unexpected in some cases. For example, when setting lang: zh-CN, CSL will use localization that I don't want, as I have reported in an old issue.

So, I think the current implementation of adding East Asian font hints is good and no need to change. Perhaps I could write a Lua filter to remove them when writing English articles if necessary.

TomBener commented 1 week ago

I tried to write a Lua filter as follows:

function traverse(elem)
    if elem.t == "RawBlock" or elem.t == "RawInline" then
        if elem.format == "openxml" then
            elem.text = elem.text:gsub('<w:rFonts w:hint="eastAsia" />', '')
        end
    end

    return elem
end

return {
    { RawBlock = traverse },
    { RawInline = traverse }
}

But it didn't work. Could you please help to diagnose it or give some guidance?

jgm commented 6 days ago

A lua filter can't remove these because they are added in the writer. Lua filters only affect the AST (which is the input to the writer).

TomBener commented 6 days ago

Thanks for your guidance. Are there any alternative ways?

jgm commented 6 days ago

Nothing will work but postprocessing the docx. (It wouldn't be that hard to find and remove the offending elements from the context in the container.)

Again, I'm open to providing this flexibility in pandoc, but I need to figure out what the best way to do it would be.

jgm commented 6 days ago

Sorry, I don't think this is a good idea as lang may affect other settings, which are unexpected in some cases. For example, when setting lang: zh-CN, CSL will use localization that I don't want, as I have reported in an https://github.com/jgm/pandoc/issues/7022#issuecomment-1238093008.

You needn't set the document-wide lang. We could have the feature be sensitive to a lang on a div, for example. So you could put Chinese content inside

::: {lang=zh}
...
:::

and the Word writer could be trained to add the font hints inside that context (unless overridden by an interior span or div with lang=en).

TomBener commented 6 days ago

Thanks. I believe the step for post-processing the docx is feasible.

Regarding the language attribute, I think there is no need to change the current implementation as the East Asian Languages should always be enclosed with eastAsia font hints, no matter what the document language is. The peculiar need I request here is not usual.

tarleb commented 6 days ago

Quick suggestion for post-processing: Using a binary custom Lua writer, i.e., a custom writer that defines a ByteStringWriter function instead of a Writer function, can be used to do the post-processing in pandoc itself. The pandoc.zip module can be used to unpack and re-pack the output of pandoc.write, and the file entries of the archive can be modified via normal string processing.

jgm commented 5 days ago

the East Asian Languages should always be enclosed with eastAsia font hints, no matter what the document language is.

The difficulty is determining whether quotation marks surrounding a Chinese phrase should themselves be considered East Asian or not. As you've noted, that depends on the context. Hence my suggestion to make this sensitive to language tagging.

TomBener commented 5 days ago

Quick suggestion for post-processing: Using a binary custom Lua writer, i.e., a custom writer that defines a ByteStringWriter function instead of a Writer function, can be used to do the post-processing in pandoc itself. The pandoc.zip module can be used to unpack and re-pack the output of pandoc.write, and the file entries of the archive can be modified via normal string processing.

Sounds a promising method. But I cannot fully understand it. Could you please provide more details? For example, I'd like to remove <w:rPr><w:rFonts w:hint="eastAsia" /></w:rPr> in document.xml under the unzipped docx, how to use this method? Thanks!

TomBener commented 5 days ago

The difficulty is determining whether quotation marks surrounding a Chinese phrase should themselves be considered East Asian or not. As you've noted, that depends on the context. Hence my suggestion to make this sensitive to language tagging.

All issues come from that simplified Chinese and English use the same quotation mark (Traditional Chinese does not). I think Pandoc does't need to try to handle this tricky issue further.

A Japanese designer has submitted a proposal to add standardized variation sequences for four quotation marks. I hope it can be adopted as soon as possible:

This document is a proposal for adding eight standardized variation sequences (SVSes) for the following four quotation marks that use VS1 (aka U+FE00) and VS2 (aka U+FE01) to distinguish between the forms whose usage varies according to well-established Western versus East Asian conventions:

U+2018 ‘ LEFT SINGLE QUOTATION MARK U+2019 ’ RIGHT SINGLE QUOTATION MARK U+201C “ LEFT DOUBLE QUOTATION MARK U+201D ” RIGHT DOUBLE QUOTATION MARK

tarleb commented 3 days ago

Could you please provide more details?

Sure, here we go:

--- file: docx-no-eahints.lua
-- Copyright: © 2024 Albert Krewinkel
-- License: MIT

local mediabag = require 'pandoc.mediabag'
local path = require 'pandoc.path'
local zip = require 'pandoc.zip'

function ByteStringWriter(doc, opts)
  local docx = pandoc.write(mediabag.fill(doc), 'docx', opts)
  local archive = zip.Archive(docx)
  for i, entry in ipairs(archive.entries) do
    if path.filename(entry.path) == 'document.xml' then
      local pattern = '<w:rPr><w:rFonts w:hint="eastAsia" /></w:rPr>'
      local newcontent = entry:contents():gsub(pattern, '')
      archive.entries[i] = zip.Entry(entry.path, newcontent)
    end
  end
  return archive:bytestring()
end

Use with

pandoc --to=docx-no-eahints.lua -o my-outfile.docx …

It's not really well-tested, but should work. Or, at the very least, should give a better idea of what I meant, and how this could work.

TomBener commented 3 days ago

Thanks @tarleb, it works. But I encounter an issue that the page size was changed from A4 to US Letter after applying the Lua filter. The original XML tags in document.xml was removed from the reference docx via --reference-doc:

<w:sectPr w:rsidR="00D3414C">
  <w:pgSz w:h="16840" w:w="11900" />
  <w:pgMar w:bottom="1440" w:footer="720" w:gutter="0" w:header="720" w:left="1440" w:right="1440" w:top="1440" />
  <w:cols w:space="720" />
  <w:docGrid w:linePitch="360" />
</w:sectPr>

BTW, is it possible to use this Lua filter with Quarto?

tarleb commented 3 days ago

Thanks @tarleb, it works. But I encounter an issue that the page size was changed from A4 to US Letter after applying the Lua filter. The original XML tags in document.xml was removed from the reference docx via --reference-doc:

Using --reference-doc with the custom writer should still be possible.

BTW, is it possible to use this Lua filter with Quarto?

I don't know, sorry.

TomBener commented 3 days ago

I've uploaded a folder with files for testing: lua-custom-writer-test.zip

With the same source input file test.md and custom.docx as the reference-doc, running the command:

pandoc test.md -o test.docx --reference-doc custom.docx

generated test.docx with the A4 page size. But if running the command:

pandoc test.md -o test.docx --reference-doc custom.docx -t docx-no-eahints.lua

would generate test.docx with the US Letter page size. By unzipping test.docx, I was able to confirm that the later conversion removed East Asian font hints, but it also unexpectedly removed the following XML for defining page size:

<w:sectPr w:rsidR="005F2E0E" w:rsidSect="002F2276">
    <w:pgSz w:h="16840" w:w="11900" />
    <w:pgMar w:bottom="1440" w:footer="720" w:gutter="0" w:header="720" w:left="1440" w:right="1440" w:top="1440" />
    <w:cols w:space="720" />
    <w:docGrid w:linePitch="326" />
</w:sectPr>

This behavior seems weird and I have no idea what's the problem, could you please help to diagnose the issue @tarleb

tarleb commented 3 days ago

Weird. I currently don't have time to debug this, but it would be nice to get to the bottom of this. Does the reference doc get applied at all?

TomBener commented 3 days ago

Weird. I currently don't have time to debug this, but it would be nice to get to the bottom of this. Does the reference doc get applied at all?

Never mind, it's not urgent. The reference doc was applied in both conversions. You can see them in the folder above.