INM meta/iast conversion

funderburkjim commented 6 years ago

This issue is for comments regarding the conversion of the Cologne digitization inm.txt of the work Index to the Names in the Mahabharata.

funderburkjim commented 6 years ago

IAST issues

The text always uses Latin alphabet with diacritics for Sanskrit words. Generally, the conventions of the text agree with modern IAST, but with the differences:

We use 'ṃ' for anusvara, instead of the author's ṁ.
We use 'ṣ' for the cerebral sibilant, instead of the author's 'sh'
We use 'ś' for the palatal sibilant, instead of the author's 'ç'.

There is some incompleteness in the conversiion of 'sh' to ṣ. This conversion must be restricted to Sanskrit words, to avoid undesired conversions in English words such as 'should. For the 'sh' conversion, the following assumptions were used:

raw headwords [ beginning of first entry line {@X@}¦] are Sanskrit
words in italics are Sanskrit.
words which have a diacritic (i.e. letter-number in original AS coding) are Sanskrit In these cases, it is safe to change 'sh' to 'ṣ'.

I'm sure there are some 'sh' conversions in Sanskrit words which are missed, (such as words or abbreviations which are not in italics and don't have a diacritic).

There are a few (40) cases where a vowel (with or without macron) also has a breve diacritic.

funderburkjim commented 6 years ago

Sections of the text

The digitization includes not only the main section of entries, but apparently all of the text. There are the following sections:

; TITLE
; FOREWARD
; PREFACE
; ABBREVIATIONS
; CONCORDANCE (33 pages)
; ENTRIES  about 13000 headwords
; ADDITIONS AND CORRECTIONS  (18 pages)
; POSTSCRIPT

Since all of the non-entry sections are digitized (part of inm.txt), it would be feasible to include them in the Front matter section .

funderburkjim commented 6 years ago

Suggested Enhancement: abbreviations

There are digitized sections on abbreviations in the preface. These could provide the basis for <ab> markup that would facilitate tooltips for users.

funderburkjim commented 6 years ago

Possible additional headwords

There are at least two possible sources of additional headwords.

`<div n="HI">`

This markup appears 22 times within entries. For instance under headwords DanadA:

<L>3353<pc>240-1<k1>DanadA<k2>DanadA
{@Dhanadā,@}¦ a mātṛ. § 615{%u%} (Skanda): IX, {@46<lang n="greek"></lang>,@} 2631.
<div n="HI">{@Dhanadeśvara, Dhanādhigoptṛ, Dhanādhipa,@}
<div n="lb">{@Dhanādhipati@}¦ = Kubera, q.v.
<LEND>

It appears that this is a typographically abbreviated form of four headwords. If these were recoded somehow as separate entries, then about 80-100 additional headwords would be added.

Additions and corrections to Index

In the addtions and corrections sections, the first shorter part pertains to the Concordance, and the second longer part pertains to the index (i.e. to what we have coded as headwords). The formatting of this second part would make it possible to add as new headwords all the entries, whether additions or corrections. There are about 950 such entry-like sections.

Example of correction to Index

aBiBU original entry:

aBiBU entry correction

Example of addition to index

aBiprAya -- does not appear as headword in main index, but does appear in the additions and correctinos:

funderburkjim commented 6 years ago

Markup peculiarities

`<div n="X">`

This markup can have X as

lb typical line break. The digitization followis the line breaks of the printed text
HI see note above
P a line break with indentation.

`<F>`

Indicates footnotes. about 30 instances. Recoded in the style adopted with KRM (#200).

`<sup>`

This is used for superscript text. General functions are:

Homonym indicator within headwords; also references to particular homonyms of headwords
Footnote marker (non-numeric)

`<lang n="greek"></lang>`

Many (9600) instances. However, at least some of these are one or two letters, used for some kind of indexing, rather than Greek words; here are two examples from first page.