jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.86k stars 3.39k forks source link

Typography: localized quotation marks #10013

Open avidseeker opened 4 months ago

avidseeker commented 4 months ago

Explain the problem.

echo "العربية ---لغة العرب--- هي \"إحدى\" اللغات 'القديمة' وبالتالي..." \
    | pandoc -M lang=ar -f markdown -t rst -s

Result:

العربية —لغة العرب— هي “إحدى” اللغات ‘القديمة’ وبالتالي…

Expected: to have quotation marks as specified in this table.

العربية —لغة العرب— هي «إحدى» اللغات ‹القديمة› وبالتالي…

Pandoc version? v3.1.13

Extra details I found this in "Other relevant metadata fields" section,

A few other metadata fields affect bibliography formatting.

lang The lang field will affect how the style is localized, for example in the translation of labels, the use of quotation marks, and ...

It might be convenient to extend this so that it also cover typesetting using smart extension.

Edit: markdown-it javascript parser has a similar typographer option specified using

  // Enable some language-neutral replacement + quotes beautification
  // For the full list of replacements, see https://github.com/markdown-it/markdown-it/blob/master/lib/rules_core/replacements.mjs
  typographer:  false,

  // Double + single quotes replacement pairs, when typographer enabled,
  // and smartquotes on. Could be either a String or an Array.
  //
  // For example, you can use '«»„“' for Russian, '„“‚‘' for German,
  // and ['«\xA0', '\xA0»', '‹\xA0', '\xA0›'] for French (including nbsp).
  quotes: '“”‘’',
tarleb commented 4 months ago

You can use the pandoc-quotes.lua filter for this.

jgm commented 4 months ago

cf #2620 #8283

jgm commented 4 months ago

To me it seems a bit strange that you'd want to use a foreign quote style in your source document (i.e. " instead of » in Arabic). I would assume that when you're using an Arabic keyboard layout, it's easy to type » directly, so why not do that?

avidseeker commented 4 months ago

The reason is keyboard layout. Usual Arabic keyboard layouts doesn't have a direct way of inputting » (Although Xorg has it). Basically, since they are prevalent in keyboard layouts and websites, they should be regarded as semantic quotes (in the sense of HTML tags).

There has been a discussion about this in Arabic Wikipedia, and I think they had bots to replace " with ». Wikipedia editors would insert ", and bots take care of the typesetting.

bpj commented 4 months ago

@jgm It is pretty common, not to say universal, for keyboard layouts to lack the proper typographic quote marks for the language. Even Xorg hides them away in AltGr positions which most people aren't aware of. Several of their layouts also have a bug that the positions which should contain single angle quotes (which admittedly are rarely used) produce the less-than and greater-than symbols — totally redundantly since there already is another key for them. That bug sits and waits for someone to take on the rather formidable task of fixing it in hundreds of layouts. My itch certainly isn't strong enough to do it. I have edited my .compose file to fix the problem.

avidseeker commented 3 months ago

Numerals might also apply to this case. A common one is Hindu-Arabic numerals ١٢٣٤٥٦٧٨٩٠ But there are others. See: this Wikipedia page.

LaTeX Babel has an option to toggle the typeset of regular Arabic numerals 1234 by using:

\babelfont{rm}[ItalicFont=FreeSerif, Numbers=Arabic]{FreeSerif}

See lua-arabic.tex for an example document.