hyphenating pali and sanskrit

To handle the hyphenation of Pali and Sanskrit, here is our plan.

Please ignore everything I've said on Pali and Sanskrit previously, eg. langpli etc! We still use that method for Chinese, which is not supported by polyglossia. But it doesn't have hyphenation, so it's a different problem space.

use polyglossia

We load the polyglossia package. This is designed as a modern system for handling multilingual Unicode documents.

Then, after all packages are loaded, we define the languages:

\setdefaultlanguage[]{english}
\setotherlanguage[script=Latin]{sanskrit}

Since there is no explicit Pali support, and since Pali and Sanskrit are very similar, we use the Sanskrit hyphenation patterns. They seem to work just fine!

Then, in the text itself, we simply do:

\textsanskrit{Nikāya}

application

It's a bit tricky to ensure we apply it everywhere it's needed. The following should cover most cases.

Summary of Contents

[x] Entries should have the form:

\item[MN 1: The Root of All Things — \textit{\textsanskrit{Mūlapariyāyasutta}}]

marked-up terms

[x] Where terms are marked up with the HTML <i lang='pli'> or <i lang-'san'> do the following:

 \textit{\textsanskrit{upajjhāya}}

Note that such terms are meant to be displayed in italics, so the above markup is required. But normally, for names, etc., just use \textsanskrit{foo}.

un-marked-up terms

There are quite a few unmarked Pali words, especially proper names (of people, places, or texts). To mark all of these is quite tricky. However, since most Pali words, especially long ones, contain diacritical marks, we can catch most of them with a simple regex.

[x] A regex something like this

\b(?=\w*[āīūṭḍṁṅñṇḷśṣṛ])\w+\b

This will get most of them. Then, unless someone has a better idea, we can simply make a hard-coded list of proper names that have no diacriticals. I made a list, it is included in the comment.

:warning: The regex must be set up to only capture what is needed. So it must exclude:

anything that is already defined as Sanskrit (i.e. the cases above)
anything that should not be defined as Sanskrit (i.e. the cases below)

things that don't have to be done

The following don't seem to be a problem:

main sutta headings
page headers
ToC

It probably wouldn't hurt to add textsanskrit, but it serves no purpose and might create side-effects. Let's leave them for now.

suttacentral / publications