jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.87k stars 3.34k forks source link

LaTeX reader: improve parsing of otherlanguage environment #9202

Closed jgm closed 9 months ago

jgm commented 9 months ago
\begin{otherlanguage}{english}
Here's a div in English. Code is ignored: \texttt{baoeuthasoe}. So are
\href{http://example.com/notaword}{URLs}.
\end{otherlanguage}

is being parsed as

[ Div
    ( "" , [ "otherlanguage" ] , [] )
 [ Para
        [ Span ( "" , [] , [] ) [ Str "english" ]
        , SoftBreak
...

Instead, pandoc should recognize {english} as an argument to the environment and populate the lang attribute (not with english but with en).

pauloney commented 9 months ago

John, it is not just the command

\begin{otherlanguage}{english}

there are quite a few more ways to choose the language in Babel and Polyglossia. I can list them all in here for you.

It is also not just english --> en but there are a number of languages that have specific names in Babel/Polyglossia and Aspell uses BCP-47 language tags. I can work that table of conversion for you as well. It is not "readly" avilable in Polyg, as I mentioned, but it can be deduced from the packages files - I just want to find an automated way to do it, so we can use future Polyg distributions.

pauloney commented 9 months ago

Here are (what I believe) are all possible ways to set and use a language in LaTeX:

Babel:

Setting:

\documentclass ‣ \documentclass[⟨lang⟩]{article}
hyperref ‣ \usepackage[pdflang=es-MX]{hyperref}
\DocumentMetadata ‣ \DocumentMetadata{lang=es-MX}
\PassOptionsToPackage ‣ \PassOptionsToPackage{main=english}{babel}
\usepackage[⟨lang⟩]{babel}
\usepackage[english,russian,french]{babel} % default lang is the last one.
\usepackage[main=english,russian,french]{babel} % main selection key use.
\usepackage[georgian, provide=*]{babel}
\babelprovide[import]{thai}
\babelprovide[import,main]{arabic} % main is arabic
\babeltags ‣ \babeltags{de = german}

Using:

\selectlanguage ‣ \selectlanguage[⟨options⟩]{⟨lang⟩}
\foreignlanguage ‣ \foreignlanguage[⟨options⟩]{⟨lang⟩}{⟨…⟩}
otherlanguage (env.) ‣ \begin{otherlanguage}[⟨options⟩]{⟨lang⟩} … \end{otherlanguage}
otherlanguage* (env.) ‣ \begin{otherlanguage*}[⟨options⟩]{⟨lang⟩} … \end{otherlanguage*}
\text⟨lang⟩ ‣ \text⟨lang⟩{...} % If \babeltags is set.
⟨lang⟩ (env.) ‣ \begin{⟨lang⟩} ... \end{⟨lang⟩} % If \babeltags is set.

Polyglossia:

Setting:

\setdefaultlanguage ‣ \setdefaultlanguage[⟨options⟩]{⟨lang⟩}
\setmainlanguage ‣ \setmainlanguage[⟨options⟩]{⟨lang⟩}
\resetdefaultlanguage ‣ \resetdefaultlanguage[⟨options⟩]{⟨lang⟩}
\setlanguagealias ‣ \setlanguagealias[⟨options⟩]{⟨language⟩}{⟨alias⟩}
\setlanguagealias* ‣ \setlanguagealias*[⟨options⟩]{⟨language⟩}{⟨alias⟩}

Using:

\text⟨lang⟩ ‣ \text⟨lang⟩[⟨options⟩]{...}
\textlang ‣ \textlang[⟨options⟩]{⟨lang⟩}{...} 
⟨lang⟩ (env.) ‣ \begin{⟨lang⟩}[⟨options⟩] ... \end{⟨lang⟩}
⟨alias⟩ (env.) ‣ \begin{lang}{⟨alias⟩} ...  \end{lang}{⟨alias⟩} % If \setlanguagealias is set.
jpcirrus commented 9 months ago

Adding to @pauloney's above comment. Babel, together with other packages, also recognizes languages set in the options to \documentclass, with the last listed language being the main language. Since babel 3.49 (2020-10-03) these can then be used with \usepackage[package,options,provide*=*]{babel}, which works with \babelprovide{} and automatically sets the options import and main (section 1.13 of babel manual).

pauloney commented 9 months ago

Thnks @jpcirrus! I reviewed my list after your comments.

pauloney commented 9 months ago

Here is the spreadsheet containg the:

  1. Languages supported by Babel
  2. Languages supported by Polyglossia
  3. The BCP-47 code of each one.
  4. If it is supported by Aspell
  5. If it is supported by Hunspell

I added Hunspell because it is a better speller and there is way more development there now, and the set of supported languages is sligthly different. Having an option to use either (or both) would be realy nice.

The Babel list has just the names of the langs, the Polyglossia one is more detailed because of the variations -- most of them not important for the choise of lang (one can spell an es-MX file with an es-ES disctionary for the most part), but some are really important, for example both Aspell and Hunspell have pt-PT and pt-BR dictionaries.

The BCP-47 is certainly the best wayt to pass a parameter from LaTeX to Pandoc to Aspell, so that is included as well.

Supported_Languages.txt

Supported_Languages.ods

jgm commented 9 months ago

Here is the code we use to do these conversions: https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/LaTeX/Lang.hs#L14-L237 If you notice omissions, perhaps do a PR so we can update?

pauloney commented 9 months ago

John, this is great! I am not able to follow up on all the details of the code because of my limited Haskell skills, but the logic down in the languages looks all right.

Is there a way I can do some quick tests, command line or small files? I want to check if things are indeed correct and complete -- in making the list I found at least two wrong BCP-47 tags in Aspell.