Open nima87 opened 5 years ago
It's unfortunate in this case that Word marks rtl but not ltr.
Here's a step toward a solution. Convert the Word to markdown (even with the released version, since the rtl spans are useless here), and explicitly add spans marking the English passages, like so:
نعنا[^1] ([The Wild Flower Key. Frederick Warne & Co. p. 310.]{lang=en}) گیاهی است
علفی با ریشه هوایی و ساقههای مستقیم و چهارگوش و زیرزمینی. ساقه و
برگهای خوشبوی آن خوراکی و دارویی است و گاهی گلهای رنگین دارد.
Now convert the resulting markdown file to a PDF, using
pandoc fmpandoc.md -o fmpandoc.pdf --pdf-engine=xelatex -Mlang=fa -Mmainfont=Tahoma
You should see proper alignment for the English and Persian phrases. (This approach uses pandoc's default polyglossia.) Maybe not quite what you want, and requires manual intervention via an intermediary markdown file, but a start.
What we need is a way to automatically mark up the English bits as english (or alternatively as dir=ltr
).
This could probably be done using a lua filter, but it's a bit complex since you have to put the span over multiple consecutive elements.
@jgm Thank you, So you mean I go into the markdown file and include each instance of english word or group of words in a ([...]{lang=en})
. It's actually what I have to do in tex, put them in \lr{...}. Now my solution with emacs is nearly semi-automatic. What I do with my tex file is using emacs query-replace-regexp
M-x query-replace-regexp \(\ *\)\([[:ascii:]]*[a-zA-z]\) RET \1\\lr{\2}
It will incrementally find and replace the instances of English words. In some cases, the English group boundaries is not recognized correctly, so I have to stop the query-replace-regexp
and put the phrase in the argument of \lr{}
command manually and conviniently enter M-C-%
on this point [the key binding for query-replace-regexp] and continue the replacement until the next improper instance. Keep in mind that my document is 600 pages of A4 and there is in every single page several instances of English word or words. It would've saved a great deal of time if I could find an automatic solution in Pandoc, even without this feature Pandoc is great for me. Just imagine without Pandoc what my task consisted in: read line by line and find italic, bold, underline word or words, both English and Persian and type \textit{...} \textbf{...} \textunderline{...}, let alone figures, tables, and much more. So Pandoc really helps.
Have a nice time.
If I understand correctly, the underlying problem is that word doesn't have a representation for rtl-documents (like latex, html and pandoc-markdown have). In Word, all documents are ltr-documents, and some (or all) parts of the text are marked-up as rtl.
But if you know it's actually a rtl-document, you could use a lua-filter to transform pandoc's internal document AST to what the LaTeX writer expects, as @jgm mentioned. With the current pandoc nightly-build, the filter would:
dir=ltr
dir=rtl
dir: rtl
It's not the most straight-forward filter, but shouldn't be impossible either.
@mb21 I'm not a programmer but as far as I know neither markdown nor latex have annotations for rtl. In case of latex if you load a package called bidi you can mark rtl and print it in the output. I think what you suggest is practical but there are complications such as punctuations. I shall find a way to recognize a block of text as a single object. Punctuations for example periods, colons, semi colons are always stick to the preceding character and seperated with the next character by a space. Maybe giving a set of ascii numbers plus punctuation character codes.
@jkr could I get your thoughts on the feature I added in ad9770fe86d0f7d9e8ccfe09eada1e7a60ef3d25 ? I see you earlier added some rtl support for the docx writer. I want to make sure this is compatible and makes sense.
The code itself looks good. I do remember, though, that there were some subtleties that kept me from implementing it (or stalled out my motivation):
https://github.com/jgm/pandoc/issues/3147
It seems likely that your implementation will take care of the majority. What we want to handle (taking English and Arabic as example languages) is something
The locale is mainly important here, because of how the default bidi
and rtl
settings would pop up.
Offhand, I'm not sure if your changes would cover these bases. Unfortunately it's a bit of a hectic week, so I might not get to look at it more closely for a few days. But this sounds like a job for TDD anyway. My brother-in-law is a Hebrew philologist working in the UK, and I think he works on both English and Israeli computers, so I might be able to get the above collection of docs from him, though, if that would help.
Yes, it would be helpful to have some real-world test documents.
What we'd ideally like is to detect the "default" setting of the ltr attribute for the document. Then we could set Just LTR
instead of Nothing
for the unmarked bits in a document whose default is rtl. But I didn't see anything obvious in the document linked above that says "the default for this is rtl."
5545
I think I explained my request(and not really a problem with Pandoc) very clearly, but I repeat it here again, however, if you find any part ambiguous, please ask me to elaborate further. fmpandoc.docx Github doesnt support tex files so I couldn't upload Pandoc converted tex file. Suppose you convert this file to tex with Pandoc by:
For the sake of simplicity you can safely remove your tex preamble and add these lines before
\begin{document}
:When compiling this file with xelatex, the English group is rendered reversely, that is, (The Wild Flower Key. Frederick Warne \& Co. p. 310.) is rendered (.310 p. Co. & Warne Fredrick Key. Flower Wild The). If you put it inside \lr{...} command the order of English sentence is rendered correctly. What I had in mind was not distinguishing LTR and RTL words, solely LTR words. I asked if it was possible to put ltr words inside an
\lr{}
command using Pandoc? I would like to appreciate your efforts in creating and developping Pandoc, it is really great. Thank you. Best.\LTRfootnote{}
and if it contains both rtl and ltr it is\RTLfootnote{}
.bidi pkg
, you wouldn't needxepersian
. You will have to define a Persian font family and put you RTL words in\RL{...}
command or your paragraph in\begin{RTL}...\end{RTL}
(case-sesitive), of course with the inclusion of your persian font command in both. In case you were interested, I would upload a mwe.