jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.38k stars 3.37k forks source link

lang attribute fits for latex but not for html lang-attribute-value #1614

Closed maybegeek closed 9 years ago

maybegeek commented 10 years ago

Hi there,

for documents in german language I use

the first beeing for babel, second for polyglossia (i use mainly xelatex). If I pandoc to html the lang-attribute (ngerman) gets used but ngerman is not a valid value for the html-attribute (nor would be german).

html needs something like de or de-DE.

The documentation says lang language code for HTML or LaTeX documents.

And therefore the documentation is completely wright, you could use it for the one or the other, not both at the same time.

perhaps this needs clarification in the docs and/or a different approach. Meanwhile you can always override the yaml attributes in your central file with direct command line switches.

all the best, christoph

ousia commented 10 years ago

The documentation says lang language code for HTML or LaTeX documents.

And therefore the documentation is completely wright, you could use it for the one or the other, not both at the same time.

@maybegeek: In my opinion, if this isn’t a bug in pandoc, it should be improved.

The document language attribute is also relevant (at least) for:

Not abstracting the lang value to fit all possible output formats that make use of it, in my opinion, is a missing implementation.

ousia commented 10 years ago

@maybegeek, I’m afraid I found a bug.

This source markdown document:

---
title: Titel
language: de-DE
...

# Kapitel

Mein Text

is converted by pandoc into the following standalone html document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title>Mein Titel</title>
  <style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<div id="header">
<h1 class="title">Mein Titel</h1>
</div>
<h1 id="titel">Titel</h1>
</body>
</html>

Where is the language attribute in the html document? I miss the <xml:lang="de-DE"> from either the <html> or the <body> element.

I use pandoc 1.12.3.3. I wonder whether this has been fixed in a later version.

maybegeek commented 10 years ago

Hi Pablo,

not so fast :)

language != lang


title: Titel

lang: de-DE

Kapitel

Mein Text

is working: pandoc -s -to html5 input.md -o output.htm

beste grüße aus Regensburg (Bayern)

christoph

Christoph Pfeiffer M.A.

Rechenzentrum UR - Universität Regensburg Referat I/4 - IT-Schulungen

0941 943-4869 | Raum 0.07 christoph.pfeiffer@ur.de gpg 4096/719EB401 fprint: 870E 8B4C 0130 1F4B 099A 4ACE 7E58 7E29 719E B401

On Mon, Sep 15, 2014 at 10:05 PM, Pablo Rodríguez notifications@github.com wrote:

@maybegeek https://github.com/maybegeek, I’m afraid I found a bug.

This source markdown document:


title: Titel language: de-DE ...

Kapitel

Mein Text

is converted by pandoc into the following standalone html document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Mein Titel

Titel

Where is the language attribute in the html document? I miss the from either the or the element. I use pandoc 1.12.3.3. I wonder whether this has been fixed in a later version. — Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/1614#issuecomment-55649901.
jgm commented 10 years ago

See the documentation (README) on variables used in templates. (Or just look at the HTML template itself.)

lang (not language) is what you use for the language code in HTML or LaTeX documents. language is used for EPUBs specifically (we used the same names as used in Dublin Core metadata). You can set both of them, of course.

That's not to say that there isn't an issue here (the original poster's): you need different settings for lang in LaTeX and HTML, so you can't set this in the metadata if you need both formats. Some kind of automatic conversion would be convenient.

+++ Pablo Rodríguez [Sep 15 14 13:05 ]:

@maybegeek, I’m afraid I found a bug.

This source markdown document:

---
title: Titel
language: de-DE
...

# Kapitel

Mein Text

is converted by pandoc into the following standalone html document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 <meta http-equiv="Content-Style-Type" content="text/css" />
 <meta name="generator" content="pandoc" />
 <title>Mein Titel</title>
 <style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<div id="header">
<h1 class="title">Mein Titel</h1>
</div>
<h1 id="titel">Titel</h1>
</body>
</html>

Where is the language attribute in the html document? I miss the <xml:lang="de-DE"> from either the <html> or the <body> element.

I use pandoc 1.12.3.3. I wonder whether this has been fixed in a later version.


Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/1614#issuecomment-55649901

ousia commented 10 years ago

@jgm, I think that a way to solve this would be to have a file that converts language from its HTML value to the LaTeX value.

It would be something similar to:

en -> english
en-US -> USenglish
en-UK -> UKenglish

Of course, I don’t know what the best format for pandoc is.

Could you provide the right format for the minimal sample above?

So, I could provide the file with the full list of languages supported by LaTeX.

Many thanks for your help.

ousia commented 9 years ago

@jgm, I attach a list that contains the equivalences between ISO-639 language codes used in HTML and LaTeX language codes.

I avoided three-letter language codes as much as I could. But there are language codes that aren’t defined in two-language codes (such as ancient Greek).

af -> afrikaans
af-ZA -> afrikaans
ar -> arabic
bg -> bulgarian
bg-BG -> bulgarian
br -> breton
ca -> catalan
ca-ES -> catalan
cy -> welsh
cy-UK -> welsh
cz -> czech
cz-CZ -> czech
da -> danish
da-DK -> danish
de -> ngerman
de-1901 -> german
de-AT -> naustrian
de-AT-1901 -> austrian
de-DE -> ngerman
dsb -> lowersorbian
el -> greek
el-poly -> greek.polutoniko
en -> english
en-AU -> australian
en-CA -> canadian
en-NZ -> newzealand
en-UK -> british
en-US   -> american
eo -> esperanto
es -> spanish
es-ES -> spanish
et -> estonian
et-EE -> estonian
eu -> basque
eu-ES -> basque
fa -> farsi
fa-IR -> farsi
fi -> finnish
fi-FI -> finnish
fr -> french
fr-CA -> canadien
fr-FR -> french
fra-aca -> acadian
fur -> friulan
ga -> irish
ga-IE -> irish
gd -> scottish
gd-UK -> scottish
gl -> galician
gl-ES -> galician
grc -> greek.ancient
he -> hebrew
he-IL -> hebrew
hi -> hindi
hi-IN -> hindi
hr -> croatian
hr-HR -> croatian
hsb -> uppersorbian
hu -> magyar
hu-HU -> magyar
id -> indonesian
id-IN -> indonesian
ie -> interlingua
is -> icelandic
is-IS -> icelandic
it -> italian
it-IT -> italian
jp -> japanese
jp-JP -> japanese
la -> latin
lt -> lithuanian
lt-LT -> lithuanian
lv -> latvian
lv-LV -> latvian
mn -> mongolian
mn-MN -> mongolian
nb -> norsk
nb-NO -> norsk
nl -> dutch
nl-NL -> dutch
nn -> nynorsk
nn-NO -> nynorsk
no -> norsk
no-NO -> norsk
pl -> polish
pl-PL -> polish
pt -> portuguese
pt-BR -> brazilian
pt-PT -> portuguese
rm -> romansh
rm-CH -> romansh
ro -> romanian
ro-RO -> romanian
ru -> russian
ru-RU -> russian
se -> samin
se-FI -> samin
sk -> slovak
sk-SK -> slovak
sl -> slovene
sl-SL -> slovene
sr -> serbian
sv -> swedish
sv-SE -> swedish
th -> thai
th-TH -> thai
tk -> turkmen
tr -> turkish
tr-TR -> turkish
uk -> ukrainian
uk-UA -> ukrainian
vi -> vietnamese
vi-VN -> vietnamese
mpickering commented 9 years ago

Thank you for compiling this list Pedro.

ousia commented 9 years ago

@mpickering, would you be interested in the corresponding list for ConTeXt?

This would be required for issue (#1667).

HughP commented 9 years ago

@ousia In thinking about inclusivity of languages, do your language lists include ISO 639-3 additions or are you limiting your list to ISO 639-1 listings? BCP47 suggests to use the shortest ISO 639 code for a language (I take that to mean that ISO 639-3 is used when there is no corresponding ISO 639-1 code for said language), while the Dublin Core standard points to ISO 639-3.

ousia commented 9 years ago

@HughP, I took the shortest code from the ISO 639-1 listing. But I had to use ISO 639-3 for languages not defined in ISO 639-1 (such as grc and similar ones).

The language list is limited to the languages LaTeX can handle. If anyone wants to add ISO 639-3 codes for the already defined ISO 639-1 codes, that would be fine. But I think this addition would make sense after the first list is implemented in pandoc.

mb21 commented 9 years ago

There's agreement that lang should contain ISO 639 format that is then translated to LaTeX.

But should lang be a list of languages or only one? I think I agree with @ousia that it would be better for lang to contain only one language and serve as a synonym for mainlang. @jgm?

Authors could then use otherlang explicitly to specify a list of other languages. Finally, should otherlang also be in ISO 639 format, even though it's currently only supported by LaTeX which is exactly the format that doesn't use ISO codes?

jgm commented 9 years ago

One possible approach would allow lang to be either a single value or a list.

If a single value, it fills 'lang' and 'otherlang' is empty. If a list, the first item becomes 'lang' and the rest 'otherlang'.

+++ mb21 [Aug 12 15 06:54 ]:

There's agreement that lang should contain ISO 639 format that is then translated to LaTeX.

But should lang be a list of languages or only one? I think I agree with @ousia that it would be better for lang to contain only one language and serve as a synonym for mainlang. @jgm?

Authors could then use otherlang explicitly to specify a list of other languages. Finally, should otherlang also be in ISO 639 format, even though it's currently only supported by LaTeX which is exactly the format that doesn't use ISO codes?


Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/1614#issuecomment-130313799

mb21 commented 9 years ago

Yeah, I just never understood why the last and not the first language in the list is the mainlang, so why not make it explicit? Also, I suspect it wouldn't be backwards-compatible with existing document anyway, since those would use LaTeX format for multiple languages, not ISO 639.

ousia commented 9 years ago

@jgm and @mb21,

there is a pending pull request in jgm/pandoc-templates#101, that implements some issues already discussed on the mailing list:

There are two issues pending. I have contacted both babel and polyglossia developers. They are interested in loading languages by ISO 639 codes. In fact, these codes with language-region structure (such as en-GB) seem to be BCP-47 codes, since ISO 639 only refers to languages themselves (I realized that yesterday [@HughP, I owe you an apology]).

But it may take a while before it has implemented. I hope I can include the language synonyms in the LaTeX templates as soon as possible (so that there is no need to wait for the implementation in the packages themselves.

The pull request jgm/pandoc-templates#101 is waiting for review and I hope it may be merged.

jgm commented 9 years ago

I would argue that it's best to stick with lang rather than language. If we stick with lang, then we don't need to change existing xml/html templates, which already use lang, and people don't need to change their custom templates or workflows.

Although it's true that we use complete words for other fields, there might even be a reason for using lang instead of language: it is a kind of signal that this field takes technical values like en-US rather than English.

ousia commented 9 years ago

@jgm and @mb21,

jgm commented 9 years ago

+++ Pablo Rodríguez [Aug 20 15 12:32 ]:

@jgm and @mb211,

  • language is more intuitive for non-technical users.
  • I think it is better to have a single language metadata field, not two.
  • language is required for ePub document language, so:
    • language: de-DE would be required for ePub documents.

We could easily switch to using lang here. (We could also set lang to the value of language behind the scenes, if only language is used, to avoid breaking existing documents.)

Note also that lang does appear in the EPUB templates.

  • But If we want the same source to generate other formats, lang: de-DE would be also required. Sorry, but this doesn’t seem reasonable to me.
  • The point here is to reject other values than ISO 639 or BCP-47 formats for languages.

    The issue is not the metadata field name, but the values that it accepts.

Yes, but we need to settle on a single field name.

I think your argument boils down to liking the less technical sounding language. Mine boils down to not wanting to break existing documents.

mb21 commented 9 years ago

Right, we should merge the ePUB language and the lang variable since both are BCP47 now.

I don't have a strong opinion on lang vs language, but tend to agree with @jgm: lang is backwards compatible, there's the xml:lang and HTML lang attributes (both BCP47), and eventually we'll have documents like the following:

---
lang: en
otherlangs: [ar]
---

The title in Arabic is [عنوان اول]{dir=rtl lang=ar}.
ousia commented 9 years ago
  • language is required for ePub document language, so:
    • language: de-DE would be required for ePub documents.

We could easily switch to using lang here. (We could also set lang to the value of language behind the scenes, if only language is used, to avoid breaking existing documents.)

Fine for me when either lang or language can be used with the same results.

I think your argument boils down to liking the less technical sounding language. Mine boils down to not wanting to break existing documents.

Of course, I don’t want to break existing compatibility.

My argument It isn’t about more or less technical sounding. It is about not mixing markups.

lang is (X)HTML markup. For most users, learning basic text markup may be extremely hard. Mixing markups would make it harder. And for users that aren’t fluent in English, it is even harder.

I agree that a good compromise is to be able to use either lang or language. So there is no broken compatibility.

ousia commented 9 years ago

I don't have a strong opinion on lang vs language, but tend to agree with @jgm: lang is backwards compatible, there's the xml:lang and HTML lang attributes (both BCP47), and eventually we'll have documents like the following:

---
lang: en
otherlangs: [ar]
---

The title in Arabic is [عنوان اول](dir=rtl lang=ar).

@mb21, could you please consider a different language handling?

Your proposal is:

The title in Arabic is [عنوان اول](dir=rtl lang=ar).

But it is already accepted syntax in Markdown:

`a = b`{.variable-assignment} may be Python code

Sorry, but braces should be preferred to parentheses for attribute assignment.

And specific to languages, I see two main issues:

With both issues, your sample would read:

The title in Arabic is [عنوان اول](:ar).

I think that with this proposal (explained in #895), the user has less to type (and there are fewer possibilities to make mistakes).

mb21 commented 9 years ago

@ousia I'm sorry, I meant to write the following (must have been tired, it's corrected above now):

The title in Arabic is [عنوان اول]{dir=rtl lang=ar}.

While a = b{.myClass} translates to <code class="myClass">a=b</code, it has long since been proposed to use [foo]{.myClass} to mean <span class="myClass">foo</span>, yet this hasn't been implemented in the Markdown Reader yet.

As for your other points, I've answered at #2191.

mb21 commented 9 years ago

But back to the current hold-up: I think we should have language only as a legacy fallback for the ePUB metadata. Having lang and language as complete synonyms only confuses people. More opinions?

ousia commented 9 years ago

@ousia I'm sorry, I meant to write the following (must have been tired, it's corrected above now):

The title in Arabic is [عنوان اول]{dir=rtl lang=ar}.

@mb21, sorry for my misunderstanding. I thought there was a new syntax copied from CommonMark.

ousia commented 9 years ago

But back to the current hold-up: I think we should have language only as a legacy fallback for the ePUB metadata. Having lang and language as complete synonyms only confuses people. More opinions?

@mb21, I think the right option would be to make both names full synonyms explaining that lang is only kept to avoid breaking backwards compatibility. And advising users that the new field language should be used for new documents and templates.