jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.04k stars 3.35k forks source link

Write filter to support right-to-left direction in Persian text. #2191

Closed khajavi closed 8 years ago

khajavi commented 9 years ago

I need to convert the Persian text like this:

# عنوان اول
این متن فارسی باید راست به چپ نشان داده شود.

This is the English paragraph, so it's direction in html should be left-to-right.

To HTML like this:

<h1 dir="rtl">عنوان اول</h1>
<p dir="rtl">این متن فارسی باید راست به چپ نشان داده شود.</p>
<p>This is the English paragraph, so it's direction in html should be left-to-right.</p>

Any one could help me how can I write proper Pandoc filter in Haskell to solve this problem?

jgm commented 9 years ago

One option would be to check each block element for Persian letters (Text.Pandoc.Walk.query could be used). If they are present, the block could be converted to

Div ("",[],[("dir","rtl")]) [b]

where b is the original block.

This would give you output in HTML like

<div dir="rtl">
<h1>عنوان اول</h1>
</div>
<div dir="rtl">
<p>این متن فارسی باید راست به چپ نشان داده شود.</p>
</div>
<p>This is the English paragraph, so it's direction in html should be left-to-right.</p>

I don't know if this would work in browsers. If not, you could add probably some CSS so that an h1 or p contained in a div with dir="rtl" also gets the "rtl" attribute.

+++ Milad Khajavi [May 29 15 01:27 ]:

I need to convert the Persian text like this:

عنوان اول

این متن فارسی باید راست به چپ نشان داده شود.

This is the English paragraph, so it's direction in html should be left-to-right .

To HTML like this:

عنوان اول

این متن فارسی باید راست به چپ نشان داده شود.

This is the English paragraph, so it's direction in html should be left-to-ri ght.

Any one could help me how can I write proper Pandoc filter in Haskell to solve this problem?

— Reply to this email directly or [1]view it on GitHub.

References

  1. https://github.com/jgm/pandoc/issues/2191
mb21 commented 9 years ago

If you don't want to write a filter as jgm recommended, you can always mark it up manually:

# عنوان اول {dir=rtl}

<div dir=rtl>این متن فارسی باید راست به چپ نشان داده شود.</div>

This is the English paragraph, so it's direction in html should be left-to-right.

you might also be interested in the RTL discussion on talk.commonmark.org.

njbart commented 9 years ago

I think dealing with languages and directionality should become a functionality of pandoc itself rather than being delegated to filters.

My suggestion would be to primarily rely on language tags in pandoc markdown:

Most of this is already available in pandoc:

pandoc -s -t html << EOT

# عنوان اول

.این متن فارسی باید راست به چپ نشان داده شود

<span lang=en-US>This is the English paragraph, so its direction in html should be left-to-right.</span>

.این متن فارسی باید راست به چپ نشان داده شود
---
lang: en-US, fa-IR
...

EOT

generates

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US, fa-IR" xml:lang="fa-IR">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title></title>
  <style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<h1 id="عنوان-اول">عنوان اول</h1>
<p>.این متن فارسی باید راست به چپ نشان داده شود</p>
<p><span lang="en-US">This is the English paragraph, so its direction in html should be left-to-right.</span></p>
<p>.این متن فارسی باید راست به چپ نشان داده شود</p>
</body>
</html>

… which doesn’t look too bad as is, except for the facts that lang="en-US, fa-IR" should be replaced by lang="fa-IR" (just one main language per document), and that in my browsers the full stop is appearing to the right of the Farsi sentences rather than their left, in both Firefox and Safari – ideas on this, anyone?).

Unless declared explicitly, pandoc could then infer directionality from these language tags, and write, e.g.,

…
<html xmlns="http://www.w3.org/1999/xhtml" lang="fa-IR" xml:lang="fa-IR" dir="rtl">
…
<p><span lang="en-US" dir="ltr">This is the English paragraph, so its direction in html should be left-to-right.</span></p>
…

If xml:lang tags are needed, they could be added during this step, too.

For latex output, pandoc would just have to map lang: en-US, fa-IR to

  \setmainlanguage{farsi}
  \setotherlanguages{english}

and <div lang="fa-IR">…</div> to \begin{farsi}…\end{farsi}, and <span lang="fa-IR">…</span> to \textfarsi{…}` (no directionality tags needed for latex).

ousia commented 9 years ago

@nickbart1980, wasn’t otherlang supposed to be included for LaTeX?

As far as I can understand, language direction may be specified in CSS:

:lang(fa-IR) {
   direction: rtl;
}
njbart commented 9 years ago

Yes, for LaTeX a comma-separated list in the metadata variable lang is parsed into mainlang (last item) and otherlang (all others), but the values, e.g., en-US, fa-IR are not mapped yet to what polyglossia (and babel) expect, e.g. english, farsi. That's one thing that would be great to have fixed.

However, mainlang and otherlang are not available in any other formats than LaTeX (or else we could simply use mainlang in the html template). A fix for this would be great, too.

As to CSS, I’m not quite sure. Adding your snippet to my HTML document above looks ok in a browser (again, with the exception of the full stops).

On the other hand, https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/dir recommends “As the directionality of the text is semantically related to its content and not to its presentation, it is recommended that web developers use this attribute [dir] instead of the related CSS properties when possible. That way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated.”

ousia commented 9 years ago

Yes, for LaTeX a comma-separated list in the metadata variable lang is parsed into mainlang (last item) and otherlang (all others), but the values, e.g., en-US, fa-IR are not mapped yet to what polyglossia (and babel) expect, e.g. english, farsi. That's one thing that would be great to have fixed.

There is an issue (#1614) exactly on this topic. It may make sense to add comments there (so developers see the real demand for this fix).

However, mainlang and otherlang are not available in any other formats than LaTeX (or else we could simply use mainlang in the html template). A fix for this would be great, too.

Where is the fix needed? I must confess that I still don’t get it (we have already discussed this at #2174).

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

As to CSS, I’m not quite sure. Adding your snippet to my HTML document above looks ok in a browser (again, with the exception of the full stops).

I wonder whether this would work also with full stops:

:lang(fa-IR) {
   direction: rtl;
   unicode-bidi: bidi-override;
}

On the other hand, https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/dir recommends “As the directionality of the text is semantically related to its content and not to its presentation, it is recommended that web developers use this attribute [dir] instead of the related CSS properties when possible. That way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated.”

The reasoning behind this recommendation would lead to avoid as many CSS properties as possible: “[t]hat way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated”.

I don’t see the reason why the direction should also included in HTML (besides the language markup), if a given language can only have one direction.

njbart commented 9 years ago

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

That’s not so great since you would always have to tweak the source file depending on the target format. Parsing lang into mainlang and otherlang (or, alternatively, discarding all items in lang except the last for target formats that cannot ever use otherlang for any purpose) makes more sense.

I wonder whether this would work also with full stops:

:lang(fa-IR) {
   direction: rtl;
   unicode-bidi: bidi-override;
}

Unfortunately, no.

khajavi commented 9 years ago

John, This solution is good for converting markdown to HTML, so what is the general solution? It's better to be a built-in feature of Pandoc to handle right-to-left letters. So I think my question (writing filter) was not good.

On Fri, May 29, 2015 at 10:35 PM, John MacFarlane notifications@github.com wrote:

One option would be to check each block element for Persian letters (Text.Pandoc.Walk.query could be used). If they are present, the block could be converted to

Div ("",[],[("dir","rtl")]) [b]

where b is the original block.

This would give you output in HTML like

عنوان اول

این متن فارسی باید راست به چپ نشان داده شود.

This is the English paragraph, so it's direction in html should be left-to-right.

I don't know if this would work in browsers. If not, you could add probably some CSS so that an h1 or p contained in a div with dir="rtl" also gets the "rtl" attribute.

+++ Milad Khajavi [May 29 15 01:27 ]:

I need to convert the Persian text like this:

عنوان اول

این متن فارسی باید راست به چپ نشان داده شود.

This is the English paragraph, so it's direction in html should be left-to-right .

To HTML like this:

عنوان اول

این متن فارسی باید راست به چپ نشان داده شود.

This is the English paragraph, so it's direction in html should be left-to-ri ght.

Any one could help me how can I write proper Pandoc filter in Haskell to solve this problem?

— Reply to this email directly or [1]view it on GitHub.

References

  1. https://github.com/jgm/pandoc/issues/2191

— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/2191#issuecomment-106892030.

Milād Khājavi http://blog.khajavi.ir Having the source means you can do it yourself. I tried to change the world, but I couldn’t find the source code.

khajavi commented 9 years ago

Writing lang tag explicitly in technical documents are cumbersome, because in technical documents that the main language is Persian, there are lots of time that we need write in English, so it's better than Pandoc check if the paragraph is written in Persian create proper tag and if the paragraph is written in English create proper tag so.

On Mon, Jun 1, 2015 at 6:21 PM, nickbart1980 notifications@github.com wrote:

I think dealing with languages and directionality should become a functionality of pandoc itself rather than being delegated to filters.

My suggestion would be to primarily rely on language tags in pandoc markdown:

  • the existing lang: fa-IR in the document’s metadata for declaring the main language of the document.
  • for longer and
  • for shorter sections in a language different from the main language.

Most of this is already available in pandoc:

pandoc -s -t html << EOT

عنوان اول

.این متن فارسی باید راست به چپ نشان داده شود

This is the English paragraph, so its direction in html should be left-to-right.

.این متن فارسی باید راست به چپ نشان داده شود

lang: en-US, fa-IR ...

EOT

generates

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

عنوان اول

.این متن فارسی باید راست به چپ نشان داده شود

This is the English paragraph, so its direction in html should be left-to-right.

.این متن فارسی باید راست به چپ نشان داده شود

… which doesn’t look too bad as is, except for the facts that lang="en-US, fa-IR" should be replaced by lang="fa-IR" (just one main language per document), and that in my browsers the full stop is appearing to the right of the Farsi sentences rather than their left, in both Firefox and Safari – ideas on this, anyone?).

Unless declared explicitly, pandoc could then infer directionality from these language tags, and write, e.g.,

This is the English paragraph, so its direction in html should be left-to-right.

… If xml:lang tags are needed, they could be added during this step, too. For latex output, pandoc would just have to map lang: en-US, fa-IR to \setmainlanguage{farsi} \setotherlanguages{english} and
to \begin{farsi}…\end{farsi}, and to \textfarsi{…}` (no directionality tags needed for latex). — Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/2191#issuecomment-107490112.

Milād Khājavi http://blog.khajavi.ir Having the source means you can do it yourself. I tried to change the world, but I couldn’t find the source code.

ousia commented 9 years ago

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

That’s not so great since you would always have to tweak the source file depending on the target format.

@nickbart1980, I don’t think so. Let’s consider the following sample:

---
lang: en
otherlang: grc, la
...

<span lang="grc">χαλεπὰ τὰ καλά</span> was the ancient Greek saying to
state that beauty is difficult to attain.

Occam’s razor reads: <span lang="la">«entia non sunt multiplicanda sine
necessitate»</span>

If you have to tweak the source depending on your target, this isn’t due to the language information in the metadata. It has to do with the lack of translation among different language identification values [#1614]), non–existing special syntax for language attributes (#895) and missing syntax for raw division and raw inline elements (#168).

Parsing lang into mainlang and otherlang (or, alternatively, discarding all items in lang except the last for target formats that cannot ever use otherlang for any purpose) makes more sense.

To the best of my knowledge, pandoc has four variables (metadata fields) to include language information in the metadata:

Applying Occam’s razor to these variables, I think it would read: “do not create any language variable unless strictly required”.

I agree that lang is required to specify the primary language in the document. And otherlang is required by polyglossia and babel in LaTeX.

But I think that adapting lang to the way the exception (LaTeX) works is the wrong path. Because it is easier to add all secondary languages in a variable especially created for LaTeX (otherlang).

My final question is: wnat is wrong (or what does it need to be fixed) in using lang for the main language (as it is [or would be] required [once fixed] for HTML, ePub, ConTeXt, OpenDocument and .docx) and reserve `otherlang' for LaTeX?

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

That’s not so great since you would always have to tweak the source file depending on the target format.

@nickbart1980, I don’t think so. Let’s consider the following sample:

---
lang: en
otherlang: grc, la
...

<span lang="grc">χαλεπὰ τὰ καλά</span> was the ancient Greek saying to
state that beauty is difficult to attain.

Occam’s razor reads: <span lang="la">«entia non sunt multiplicanda sine
necessitate»</span>

If you have to tweak the source depending on your target, this isn’t due to the language information in the metadata. It has to do with the lack of translation among different language identification values [#1614]), non–existing special syntax for language attributes (#895) and missing syntax for raw division and raw inline elements (#168).

Parsing lang into mainlang and otherlang (or, alternatively, discarding all items in lang except the last for target formats that cannot ever use otherlang for any purpose) makes more sense.

To the best of my knowledge, pandoc has four variables (metadata fields) to include language information in the metadata:

Applying Occam’s razor to these variables, I think it would read: “do not create any language variable unless strictly required”.

I agree that lang is required to specify the primary language in the document. And otherlang is required by polyglossia and babel in LaTeX.

But I think that adapting lang to the way the exception (LaTeX) works is the wrong path. Because it is easier to add all secondary languages in a variable especially created for LaTeX (otherlang).

My final question is: what is wrong (or what does it need to be fixed) in using lang for the main language (as it is [or would be] required [once fixed] for HTML, ePub, ConTeXt, OpenDocument and .docx) and reserve `otherlang' for LaTeX?

ousia commented 9 years ago

Writing lang tag explicitly in technical documents are cumbersome, because in technical documents that the main language is Persian, there are lots of time that we need write in English, so it's better than Pandoc check if the paragraph is written in Persian create proper tag and if the paragraph is written in English create proper tag so.

@khajavi, I don’t think I understand your proposal.

But first of all, why do you need language markup? If you only require it for text direction, I wonder whether this could be achieved without language or direction tagging. It is only a guess, but isn’t the Unicode bidirectional algorithm supposed deal with this?

If you need markup for hyphenation or other language–dependent feature, then you need to mark up languages.

khajavi commented 9 years ago

@ousia

I need language markup for text direction in outputs like html and latex (mainly html). Without language markup, how can I do that? With Unicode bid algorithm? Could you explain more?

My proposal is that pandoc able to detect the language of the text, here English or Persian, and then mark the paragraph direction ltr or rtl.

On Mon, Jun 1, 2015 at 11:00 PM, Pablo Rodríguez notifications@github.com wrote:

Writing lang tag explicitly in technical documents are cumbersome, because in technical documents that the main language is Persian, there are lots of time that we need write in English, so it's better than Pandoc check if the paragraph is written in Persian create proper tag and if the paragraph is written in English create proper tag so.

@khajavi https://github.com/khajavi, I don’t think I understand your proposal.

But first of all, why do you need language markup? If you only require it for text direction, I wonder whether this could be achieved without language or direction tagging. It is only a guess, but isn’t the Unicode bidirectional algorithm supposed deal with this?

If you need markup for hyphenation or other language–dependent feature, then you need to mark up languages.

— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/2191#issuecomment-107664692.

Milād Khājavi http://blog.khajavi.ir Having the source means you can do it yourself. I tried to change the world, but I couldn’t find the source code.

jgm commented 9 years ago

+++ Pablo Rodríguez [Jun 01 15 09:12 ]:

Yes, for LaTeX a comma-separated list in the metadata variable lang
is parsed into mainlang (last item) and otherlang (all others), but
the values, e.g., en-US, fa-IR are not mapped yet to what
polyglossia (and babel) expect, e.g. english, farsi. That's one
thing that would be great to have fixed.

There is an issue ([1]#1614) exactly on this topic. It may make sense to add comments there (so developers see the real demand for this fix).

I think it's a good idea.

However, mainlang and otherlang are not available in any other
formats than LaTeX (or else we could simply use mainlang in the html
template). A fix for this would be great, too.

Where is the fix needed? I must confess that I still don’t get it (we have already discussed this at [2]#2174).

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

This makes sense to me.

njbart commented 9 years ago

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

This makes sense to me.

What it boils down to is, do we want

---
lang: fr-FR, en-US, fa-IR
...

where lang is parsed into mainlang (or just lang; containing fa-IR) and otherlang (containing fr-FR, en-US); or do we want

---
lang: fa-IR
otherlang: fr-FR, en-US
...

Both will work nicely with all formats (as soon as the latex writer maps fr-FRto frenchetc.). Since it's shorter, I have a slight preference for the first option.

ousia commented 9 years ago

@nickbart1980, many thanks for your reply.

I’m afraid that the first proposal doesn’t behave as you expect in pandoc-1.14.0.1.

---
lang: grc, it, fr, en, de, es
...

multiple languages

This gives the following HTML element:

<html xmlns="http://www.w3.org/1999/xhtml"
lang="grc, it, fr, en, de, es"
xml:lang="grc, it, fr, en, de, es">

In XML lang or xml:lang should have only one value.

From all formats that support language markup, only LaTeX needs the list of languages used in the document. This shouldn’t be the default in the way pandoc metadata deal with languages. This is the reason the otherlang variable makes sense.

And this is the reason why there is nothing to fix here. lang should only be used with a single language value.

BTW, the proposal doesn’t work even with LaTeX (the final comma after the last language is wrong):

\documentclass[grc, it, fr, en, de, es,]{article}

If the LaTeX writer needs to be adapted to the way pandoc works, this should be done. But it is crazy to adapt pandoc to the way LaTeX works. (At least, one writer is easier to do than many writers.)

mb21 commented 9 years ago

Note that language and directionality are two independent properties and shouldn't be conflated:

there is not always a one-to-one mapping between language and script, and therefore directionality. For example, Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts, and the language code az can be relevant for either.

In some scripts, such as Arabic and Hebrew, displayed text is read predominantly from right to left, although within that flow, numbers and text from other scripts are displayed from left to right.

The pandoc document metadata should have lang, otherlang and dir properties (the global dir sets the base direction). Additionally, we need the writers to properly convert the dir attribute on at least spans and divs to locally change the directionality of some ranges of text.

mb21 commented 9 years ago

@ousia btw, no-NO and nb-NO should be “norsk”, not “nynorsk” AFAIK

ousia commented 9 years ago

@ousia btw, no-NO and nb-NO should be “norsk”, not “nynorsk” AFAIK

Totally right (although the list belongs to #1614).

BTW, will be the dir metadata field created?

mb21 commented 9 years ago

As I said over at commonmark discuss, I think we should be fine with supporting spans and divs with dir attributes.

In ConTeXt, we can use \righttoleft{my span content}, \startalignment[righttoleft] my div content \stopalignment and \setupalign[righttoleft] for the base direction of the document.

When using the bidi package (which only works for XeLaTeX as far as I know), they are \RL, setRL and \usepackage[RTLdocument]{bidi} respectively.

So what about pdfLaTeX and LuaLaTeX? I guess we can forget about the former, but it would be good if we could output the same commands for both Lua- and XeLaTeX. Maybe we can redefine it somehow in our LaTeX template—that is if there is a general purpose rtl/bidi package for LuaLaTex (not only arabic or only farsi), is there? Otherwise, we'll just have to tell people to use either XeLaTeX or ConTeXt. Maybe @khaledhosny can shed some light on these questions, please? :)

ousia commented 9 years ago

@mb21, as commented in #1614, do you really think that dir has to be included in the document?

If each language has one and only one direction (and the number of languages is finite), I guess pandoc should assign direction to the language internally.

Consider a dissertation in Arabic literature written in English (or any Western language). It is easy that it may have over a thousand passages in Arabic.

What do you think it is easier to type: [Arabic text]{:ar} or [Arabic text]{dir="rtl" lang="ar"}? Which method do you think it may lead to more typing mistakes?

With ConTeXt, I had typeset a book in Spanish that had about a thousand passages in ancient Greek. And I really was relieved by the fact that I didn’t have to tag any of these texts. (Just in case you wonder, \setuplanguage[es][patterns={es, agr}].)

mb21 commented 9 years ago

As I wrote above, language and scripts are two independent properties and shouldn't be conflated, e.g. Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts.

But I think it's a good idea to introduce [Arabic text]{:ar} (or a similar simplistic syntax) as a shorthand for (and converted already by the Markdown reader to) [Arabic text]{dir="rtl" lang="ar"}. But I'd say that's a separate issue—indeed it's #895.

ousia commented 9 years ago

As I wrote above, language and scripts are two independent properties and shouldn't be conflated, e.g. Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts.

@mb21, I think there are different issues involved here:

There is a question about languages that may use different scripts that I don’t understand.

Language markup is relevant to apply resources to the tagged text, such as hyphenation dictionaries. How would you apply the right hyphenation dictionary for a language that may use more than a script if the language itself doesn’t contain which one should be? Directionality doesn’t help much here.

This is why I think that dir shouldn’t be included in the document.

But I think it's a good idea to introduce [Arabic text]{:ar} [...] But I'd say that's a separate issue—indeed it's #895.

I know they are different issues, but also related.

I wanted to discuss the issue on a simplified or special language attribute, so that it could be implemented at the same time this issue is implemented (the original issue has been opened for almost 26 months).

mb21 commented 9 years ago

The link you provided is relevant for (X)HTML markup, but I don’t think it is mandatory for any text markup dealing with languages.

True, but I think the (X)HTML folks have put a lot of thought into their docs and HTML remains one of the primary output targets of pandoc. Compared to LaTeX and ConTeXt their approach is much less of a mess and based on ISO standards. That's why I propose to model pandoc's model after the HTML model.

But yeah, I guess pandoc could extract a script tag from the BCP 47 string, yet this would require us to come up with (and maintain) a long list of language-to-script- and script-to-direction-mappings. I'm sure it's doable and if @jgm is in favour and someone gets around to implement it, why not? Meanwhile, mirroring the HTML model provides a working model, relatively simply.

mb21 commented 8 years ago

To clarify, now you can write:

---
dir: rtl
---

# عنوان اول

این متن فارسی باید راست به چپ نشان داده شود.

<div dir="ltr">
This is an English paragraph, so its direction in html should be left-to-right.
</div>

As soon as native syntax for div (#168) and span (e.g. [my text]{dir=ltr}) become available, you'll be able to use those instead.