sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

BOP meta-line/iast conversion #202

Closed funderburkjim closed 6 years ago

funderburkjim commented 6 years ago

This issue devoted to the conversion of bop.txt, the Cologne digitization of Bopp's Glossarium Sanscritum, a Sanskrit-Latin dictionary.

funderburkjim commented 6 years ago

The conversions have now been completed.

funderburkjim commented 6 years ago

Additional headwords

6 missed headwords were discovered, and properly recoded with deimal L-numbers.

funderburkjim commented 6 years ago

IAST conversion

Generally Sanskrit words are presented in Devanagari within the text; this includes headwords.

However, some words appearing in Latin alphabet have letters with diacritics.
Some of these words are related to Sanskrit words such as Vêdorum. image No attempt has been made to impose a 'modern IAST' spelling to such words as Vêdorum -- eg., we leave the circumflex diacritic as printed.

There are also numerous words in other languages in what appear to be etymological comments. For instance, there are 135 instances of 'russ.' indicating presumably related or cognate Russian words. image

The spelling of these words also has been coded with unicode characters which aim to approximate the diacritics of the text.

funderburkjim commented 6 years ago

Markup

The digitization recognizes the line breaks of the text. New lines of text are generally marked as <div n="lb">.

The original digitization also identifed the prefixes occurring within roots, and these lines have been marked as <div n="pfx">. For example under root gam:

image

There are about 1400 instances of Greek text; the Greek is uncoded and is marked as <lang n="greek"></lang>.

There are about 50 footnotes in the entries. The original digitization has been rearranged using the same strategy as used for footnotes in krm. Here is the display for the footnote under headword akza: image

funderburkjim commented 6 years ago

Enhancement suggestion: line breaks

As with other dictionaries coded line-by-line, the resulting digitization might be more useful if the hyphenated words were presented in unhyphenated form. This could be done without information loss by using the <lbinfo n="N"/> markup idea used in Burnouf and other dictionaries.

funderburkjim commented 6 years ago

Enhancement suggestion: abbreviations

The printed text preface contains two pages of abbreviations. These pages are also part of the bop.txt digitization. This list could be used as a guide to applying the <ls>X</ls> and <ab>X</ab> markup (for literary sources and general abbreviations) .

funderburkjim commented 6 years ago

Enhancement suggestion: prefixed forms headwords

Additional headwords could be generated for the prefixes associated with root entries. The regularity of the coding following the already present <div n="pfx"> markup would solve the primary problem of identification. Of course the problem of sandhi between the prefix and root needs to be solved also.

funderburkjim commented 6 years ago

Enhancement suggestion: corrections and additions

There are six pages of ADDENDA ET EMENDANDA .

The bop.txt digitization also contains these additions and corrections. They could be applied to the digitization entries. We have not thus far developed a satisfactory markup scheme for such a task; the approach used in a similar task for MW should be examined as a guide line whenever this task for BOP is undertaken.

funderburkjim commented 6 years ago

These are all the comments that come to mind regarding the BOP conversion.

gasyoun commented 6 years ago

Give Greek to the Greeks between us, Jim!