suttacentral / publications

SuttaCentral books: make HTML, EPUB, PDF
Creative Commons Zero v1.0 Universal
2 stars 1 forks source link

hyphenating pali and sanskrit #100

Closed sujato closed 2 years ago

sujato commented 2 years ago

To handle the hyphenation of Pali and Sanskrit, here is our plan.

Please ignore everything I've said on Pali and Sanskrit previously, eg. langpli etc! We still use that method for Chinese, which is not supported by polyglossia. But it doesn't have hyphenation, so it's a different problem space.

use polyglossia

We load the polyglossia package. This is designed as a modern system for handling multilingual Unicode documents.

Then, after all packages are loaded, we define the languages:

\setdefaultlanguage[]{english}
\setotherlanguage[script=Latin]{sanskrit}

Since there is no explicit Pali support, and since Pali and Sanskrit are very similar, we use the Sanskrit hyphenation patterns. They seem to work just fine!

Then, in the text itself, we simply do:

\textsanskrit{Nikāya}

application

It's a bit tricky to ensure we apply it everywhere it's needed. The following should cover most cases.

Summary of Contents

\item[MN 1: The Root of All Things — \textit{\textsanskrit{Mūlapariyāyasutta}}]

marked-up terms

 \textit{\textsanskrit{upajjhāya}}

Note that such terms are meant to be displayed in italics, so the above markup is required. But normally, for names, etc., just use \textsanskrit{foo}.

un-marked-up terms

There are quite a few unmarked Pali words, especially proper names (of people, places, or texts). To mark all of these is quite tricky. However, since most Pali words, especially long ones, contain diacritical marks, we can catch most of them with a simple regex.

\b(?=\w*[āīūṭḍṁṅñṇḷśṣṛ])\w+\b

This will get most of them. Then, unless someone has a better idea, we can simply make a hard-coded list of proper names that have no diacriticals. I made a list, it is included in the comment.

:warning: The regex must be set up to only capture what is needed. So it must exclude:

things that don't have to be done

The following don't seem to be a problem:

It probably wouldn't hurt to add textsanskrit, but it serves no purpose and might create side-effects. Let's leave them for now.

sujato commented 2 years ago

undiacritical.zip