jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.78k stars 3.39k forks source link

Recognition of LTR(Right to Left) Word(s) in a *RTL* document #5558

Open nima87 opened 5 years ago

nima87 commented 5 years ago

5545

I think I explained my request(and not really a problem with Pandoc) very clearly, but I repeat it here again, however, if you find any part ambiguous, please ask me to elaborate further. fmpandoc.docx Github doesnt support tex files so I couldn't upload Pandoc converted tex file. Suppose you convert this file to tex with Pandoc by:

pandoc -s fmpandoc.docx --wrap=none -t latex -o fmpandoc.tex

For the sake of simplicity you can safely remove your tex preamble and add these lines before \begin{document}:

\documentclass{article}
\usepackage{xepersian}% with xepersian the Bidi package is loaded.
\settextfont{Tahoma}% I suppose you have Tahoma unicode which contains Persian glyphs.

When compiling this file with xelatex, the English group is rendered reversely, that is, (The Wild Flower Key. Frederick Warne \& Co. p. 310.) is rendered (.310 p. Co. & Warne Fredrick Key. Flower Wild The). If you put it inside \lr{...} command the order of English sentence is rendered correctly. What I had in mind was not distinguishing LTR and RTL words, solely LTR words. I asked if it was possible to put ltr words inside an \lr{} command using Pandoc? I would like to appreciate your efforts in creating and developping Pandoc, it is really great. Thank you. Best.

jgm commented 5 years ago

It's unfortunate in this case that Word marks rtl but not ltr.

Here's a step toward a solution. Convert the Word to markdown (even with the released version, since the rtl spans are useless here), and explicitly add spans marking the English passages, like so:

نعنا[^1] ([The Wild Flower Key. Frederick Warne & Co. p. 310.]{lang=en}) گیاهی است
علفی با ریشه هوایی و ساقه‌های مستقیم و چهارگوش و زیرزمینی. ساقه و
برگ‌های خوش‌بوی آن خوراکی و دارویی است و گاهی گل‌های رنگین دارد.

Now convert the resulting markdown file to a PDF, using

pandoc fmpandoc.md -o fmpandoc.pdf --pdf-engine=xelatex -Mlang=fa -Mmainfont=Tahoma

You should see proper alignment for the English and Persian phrases. (This approach uses pandoc's default polyglossia.) Maybe not quite what you want, and requires manual intervention via an intermediary markdown file, but a start.

What we need is a way to automatically mark up the English bits as english (or alternatively as dir=ltr).

This could probably be done using a lua filter, but it's a bit complex since you have to put the span over multiple consecutive elements.

nima87 commented 5 years ago

@jgm Thank you, So you mean I go into the markdown file and include each instance of english word or group of words in a ([...]{lang=en}). It's actually what I have to do in tex, put them in \lr{...}. Now my solution with emacs is nearly semi-automatic. What I do with my tex file is using emacs query-replace-regexp

M-x query-replace-regexp \(\ *\)\([[:ascii:]]*[a-zA-z]\) RET \1\\lr{\2}

It will incrementally find and replace the instances of English words. In some cases, the English group boundaries is not recognized correctly, so I have to stop the query-replace-regexp and put the phrase in the argument of \lr{} command manually and conviniently enter M-C-% on this point [the key binding for query-replace-regexp] and continue the replacement until the next improper instance. Keep in mind that my document is 600 pages of A4 and there is in every single page several instances of English word or words. It would've saved a great deal of time if I could find an automatic solution in Pandoc, even without this feature Pandoc is great for me. Just imagine without Pandoc what my task consisted in: read line by line and find italic, bold, underline word or words, both English and Persian and type \textit{...} \textbf{...} \textunderline{...}, let alone figures, tables, and much more. So Pandoc really helps. Have a nice time.

mb21 commented 5 years ago

If I understand correctly, the underlying problem is that word doesn't have a representation for rtl-documents (like latex, html and pandoc-markdown have). In Word, all documents are ltr-documents, and some (or all) parts of the text are marked-up as rtl.

But if you know it's actually a rtl-document, you could use a lua-filter to transform pandoc's internal document AST to what the LaTeX writer expects, as @jgm mentioned. With the current pandoc nightly-build, the filter would:

  1. wrap every piece of text that's not already in a span in a new span with dir=ltr
  2. remove every span with dir=rtl
  3. set the document's metadata to dir: rtl

It's not the most straight-forward filter, but shouldn't be impossible either.

nima87 commented 5 years ago

@mb21 I'm not a programmer but as far as I know neither markdown nor latex have annotations for rtl. In case of latex if you load a package called bidi you can mark rtl and print it in the output. I think what you suggest is practical but there are complications such as punctuations. I shall find a way to recognize a block of text as a single object. Punctuations for example periods, colons, semi colons are always stick to the preceding character and seperated with the next character by a space. Maybe giving a set of ascii numbers plus punctuation character codes.

jgm commented 5 years ago

@jkr could I get your thoughts on the feature I added in ad9770fe86d0f7d9e8ccfe09eada1e7a60ef3d25 ? I see you earlier added some rtl support for the docx writer. I want to make sure this is compatible and makes sense.

jkr commented 5 years ago

The code itself looks good. I do remember, though, that there were some subtleties that kept me from implementing it (or stalled out my motivation):

https://github.com/jgm/pandoc/issues/3147

It seems likely that your implementation will take care of the majority. What we want to handle (taking English and Arabic as example languages) is something

  1. produced by an English-locale word, all in English (we already do this).
  2. produced by an English-locale word, with a quote in Arabic (your changes would do this).
  3. produced by an English-locale word, all in Arabic (I think your changes would do this).
  4. produced by an Arabic-locale word, all in Arabic (?)
  5. produced by an Arabic-locale word, with a quote in English (?)
  6. produced by an Arabic-locale word, all in English (?)

The locale is mainly important here, because of how the default bidi and rtl settings would pop up.

Offhand, I'm not sure if your changes would cover these bases. Unfortunately it's a bit of a hectic week, so I might not get to look at it more closely for a few days. But this sounds like a job for TDD anyway. My brother-in-law is a Hebrew philologist working in the UK, and I think he works on both English and Israeli computers, so I might be able to get the above collection of docs from him, though, if that would help.

jgm commented 5 years ago

Yes, it would be helpful to have some real-world test documents. What we'd ideally like is to detect the "default" setting of the ltr attribute for the document. Then we could set Just LTR instead of Nothing for the unmarked bits in a document whose default is rtl. But I didn't see anything obvious in the document linked above that says "the default for this is rtl."