sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

GRA meta-line/iast conversion #199

Closed funderburkjim closed 6 years ago

funderburkjim commented 6 years ago

This issue is for notes relating to changes in the digitization gra.txt (Grassman Wörterbuch zum Rig Veda).

The general format is changed to be consistent with the so-called meta-line format.
Also, non-ascii characters in the digitization are represented using Unicode. In addition, several other changes are made so that coding conventions of this dictionary are more similar to those in other converted dictionaries.

Some further details of the conversion will be mentioned in subsequent comments.

funderburkjim commented 6 years ago

meta-line conversion

This process is straightforward. We construct a 'meta' line for each previous headword entry. For instance the 4th entry is changed from

<P>{@a4n3ça,@} m., das als Antheil erlangte (s. 1. aç), daher 1) {%Antheil;%} 2) {%Erbtheil;%} 3)
 {%Partei;%} 4) {%der viele Antheile besitzt%} oder {%zu vergeben hat%} und daher 5) Name eines 
der Aditisöhne.

<P1>-as 1) 548,12. 5) 192,4; 218,1; 396,5.

<P1>-m 1) 210,5. 2) 279,4. 3) 102,4.

<P1>-a1ya 3) 112,1.

<P1>-a1 [d]. 4) 440,5; 932,9.

<P1>-a1s 1) 857,3.

to

<L>4<pc>0001<k1>aMSa<k2>a4n3ça,
{@a4n3ça,@}¦ m., das als Antheil erlangte (s. 1. aç), daher 1) {%Antheil;%} 2) {%Erbtheil;%} 3)
 {%Partei;%} 4) {%der viele Antheile besitzt%} oder {%zu vergeben hat%} und daher 5) Name eines 
der Aditisöhne.

<P1>-as 1) 548,12. 5) 192,4; 218,1; 396,5.

<P1>-m 1) 210,5. 2) 279,4. 3) 102,4.

<P1>-a1ya 3) 112,1.

<P1>-a1 [d]. 4) 440,5; 932,9.

<P1>-a1s 1) 857,3.
<LEND>

Note that the meta-line associated with the entry explicitly states that this is the 4th entry (<L>4) and provides the page number and the SLP1 form of the headword <k1>aMSa.

The first entry line is slightly altered (removal of <P> marker, and insertion of broken vertical bar.

It should be obvious that the original form can be retrieved from the meta-line form, and we check this invertibility desideratum by a programmatic step.

funderburkjim commented 6 years ago

key2 to SLP1 form

As seen in the above example the 'key2' parameter within the meta-line needs to be further modified from its AS (letter-number) coding <k2>a4n3ça,. After conversion the meta-line is

<L>4<pc>0001<k1>aMSa<k2>a/MSa

The trailing comma is also removed. Also other punctuation (such as parentheses and brackets) are also removed from key2 at this stage. Some of these removals probably have significance, which could be examined at a later time and coded in some explicit way. The presence of the broken bar in the first line of the entry provides a hook to facilitate identification of these features in the raw headword form of the text.

There are some interesting details in this conversion from AS to SLP1, which are discussed in the context of the conversion from AS to IAST in the body of the entries.

some alternate headwords

In 240 metalines, there still remain space or comma characters. Some instances:

<L>51<pc>0006<k1>akzi<k2>a/kzi, akzi/
<L>69<pc>0007<k1>agastya<k2>aga/stya, aga/stia
<L>1396<pc>0160<k1>ah<k2>ah, aMh<h>1

Some of these (such as aMh in the third example) could be treated as alternate headwords.

funderburkjim commented 6 years ago

AS-IAST conversion

This part of the conversion for Grassman is the most challenging. There is no Devanagari within this text, and there is no distinction in the typeface between Sanskrit and non-Sanskrit words. All Sanskrit words are presented using the Latin alphabet with diacritics.The main difficulty is deciding how the author's particular brand of diacritics should be interpreted into modern IAST spelling conventions.

The author does provide one useful table in this regard, on page 1:

image

The highlighted letters are variants from modern IAST.

Here is the conversion used for these variants:

Grassman Modern IAST
r-macron
ḷi
ē ai
ō au
n-macron
ç ś
funderburkjim commented 6 years ago

Accents

In addition to the diacritics shown in the table, the author also uses diacritics for accents. I have found no discussion of this in the text. So the following conclusions are inferential, sometimes based upon a few comparisons with the accents shown in PWG and MW for similar words.

For short vowels, the acute accent is used almost exclusively (probably for udAtta accent). In fact, based on the coding of the digitization, there are 17 or fewer Sanskrit words coded with a grave accent. From examination of the printed text for a few of these, I think they may be miscodings (or correct coding of poor printing) of circumflexes.

Vowels with circumflex diacritics are abundant. From a very small sample of comparisons to MW/PWG, I concluded that the circumflex represents an acute accent for a long vowel.

There are also a small number of occurrences of the semi-vowels 'v' and 'y' with acute accents; for instance the v in headword 'svar' is presented with an acute accent. My hunch is that these should be considered errors in the author's presentation, but since I have no definite principle for converting these to modern standards, I have left the v-accent and y-accent unchanged.

This table shows all the accent variations, and their conversion. The modern forms for accented 'ai' and accented 'au' are unclear to me -- so I've used 'a + i-acute' and 'a+u-acute'.

Grassman Converted IAST form Comment
á á
â ā́ a-macron + combining acute
í í
î ī́ i-macron + combining acute
ú ú
û ū́ u-macron + combining acute
ŕ ṛ́ r-dotbelow + combining acute
r-circumflex ṝ́ r-dotbelow-macron + combining acute
é é
ê
ó ó
ô
v-acute v + combining acute
ý ý

incompleteness of Unicode code points

As per the above table, we must resort to Unicode combining diacritics for some cases for which there is no separate Unicode code point. This is unfortunate, in my view, since these combining characters are difficult to work with.

funderburkjim commented 6 years ago

Diacritics in words of other languages

There are 227 instances of Greek text in the digitization. Such words appear with the Greek alphabet in the printed text, but remain uncoded in the digitization. In the final form of the digitization they are represented as <lang n="greek"></lang>.

There are occasional instances of words in other languages; the ones I've noticed are:

Some of these words also use letters with diacritics. In addition to the extended ascii letters which have a role in Sanskrit IAST, there also appear:

°  (\u00b0)     1 := DEGREE SIGN
ã  (\u00e3)     5 := LATIN SMALL LETTER A WITH TILDE
þ  (\u00fe)     1 := LATIN SMALL LETTER THORN
ě  (\u011b)    22 := LATIN SMALL LETTER E WITH CARON
ň  (\u0148)     2 := LATIN SMALL LETTER N WITH CARON
ǎ  (\u01ce)     4 := LATIN SMALL LETTER A WITH CARON
ǐ  (\u01d0)     6 := LATIN SMALL LETTER I WITH CARON
ǔ  (\u01d4)    13 := LATIN SMALL LETTER U WITH CARON
ẓ  (\u1e93)     1 := LATIN SMALL LETTER Z WITH DOT BELOW
ạ  (\u1ea1)     3 := LATIN SMALL LETTER A WITH DOT BELOW
ŷ (\u0177)      1 := LATIN SMALL LETTER Y WITH CIRCUMFLEX

Also ‿ (\u203f) 1086 := UNDERTIE is used in Sanskrit words between vowels in hiatus.

Some of these additional letters are used in the words of other languages. The number of cases is small enough to examine by hand, when anyone has the interest in investigating. We could also introduce <lang> markup for these words in other languages.

funderburkjim commented 6 years ago

Other changes in the conversion

Several other changes were made to bring the markup of gra.txt into line with the conventions used in other of the Cologne digitizations.

<div n="X">

Some of the logical divisions (those with typographical distinctiveness) within entries were coded by Thomas, and have been converted using the 'div' tag.

image

funderburkjim commented 6 years ago

<lbinfo n="N">

As with other dictionaries such as those by Burnouf and Cappeller, Thomas used a vertical bar within a word to indicate that the word was a hyphenated word beginning on one line and ending on the next line. These instances have been recoded by

Example:

Prǎpo|sitionen
           8 characters
Prǎpositionen <lbinfo n="8"/>

-- converted to emdash

{µXµ} -> X

This was used to indicate words where the intra-letter spacing was wide. In contrast to PW and/or PWG, where such letter-spacing has semantic significance, I judged that the 20 or so instances of this coding in Grassman dictionary were without significance, and thus removed this coding.

Remove §

Quite a few (1000) of the section markings of the digitization were preceded by one or two § characters. §§<P>, §<P>, $$<P1>. I could determine no distinctive meaning for these, and hence removed this § character.

… ellipsis character retained

In some other digitizations, the ellipsis character … was introduced by Thomas as a sort of markup not present in the text. However, in gra.txt, this character seems to correspond to a certain 'squiggle' character in the print; hence the … character is retained.
For instance, under headword agnidagDa

<L>85<pc>0009<k1>agnidagDa<k2>agni-dagDa/
{@agni-dagdhá,@}¦ a., {%von Feuer verbrannt%} (dah); daher 1) von den verbrannten Leichen, 2) von den vom Blitzstrahl getroffenen; siehe án-agnidagdha.
<div n="P1">-ás 1) 841,14 (yé … yé ánagnidagdhās).       <<<< ELLIPSIS
<div n="P1">-ānām 2) (Ton auf í) 929,15 (im pariśiṣṭa zu 929).
<LEND>

image

funderburkjim commented 6 years ago

6 missing headwords inserted

Headwords akrIQat, dU (hom 1 and hom 2), mar, (2nd form), varc, and SaS were not properly coded as headwords, but now have been added (with new decimal L-numbers) in this conversion.

funderburkjim commented 6 years ago

That's all folks!

With my current near-kARa vision, I can get some of these conversions out of the way.

gasyoun commented 6 years ago

added (with new decimal L-numbers

Hurray!

could be treated as alternate headwords

Agree.

This is unfortunate, in my view, since these combining characters are difficult to work with.

Yes, it's hard and dirty. AS hads it's logic behind, but not all of it is self evident. Some of the codepoints will always remain deficient. So be it. Unicode is better that AS, even if not as logical and strict as it.

The number of cases is small enough to examine by hand, when anyone has the interest in investigating.

Let me give it a try.

funderburkjim commented 6 years ago

Request translation

In preparing another summary of the Grassman IAST conventions, I came upon page 3 of the forward. I think some of this is relevant to his coding, including of accents. Could someone with German proficiency provide a translation or paraphrase?

funderburkjim commented 6 years ago

Translation begun

While this translation is of some use, someone with German/English knowledge needs to improve upon the rather awkward english1 translation, using the pdf as the source of German.

gasyoun commented 6 years ago

Could someone with German proficiency provide a translation or paraphrase?

Let's first ask the submitters of errors from Germany - please email them. If none, I will. Was thinking about this page and yes, it is related.

SergeA commented 6 years ago

There are also a small number of occurrences of the semi-vowels 'v' and 'y' with acute accents; for instance the v in headword 'svar' is presented with an acute accent. My hunch is that these should be considered errors in the author's presentation, but since I have no definite principle for converting these to modern standards, I have left the v-accent and y-accent unchanged.

Sv́ar is given as alternative for súar, it is not an eror. There was original udatta accent in U, but by sandhi U is changes to V. Actually this gives svarita accent in the following A, commonly marked with grave accent: svàr.

funderburkjim commented 6 years ago

ŷ Lithuanian?

There is just one instance of ŷ, under headword taMs. From surrounding text, I conjecture taṃsŷti is a rendering of a word in Lithuanian language. Can someone confirm?

Die Grundbedeutung ist aus dem Sanskrit nicht mit Sicherheit zu entwickeln, wol aber aus den
verwandten Sprachen. Im Litauischen ist teṃsti (pr. teṃsiu) „recken, ziehen”, taṃsŷti <lbinfo n="4"/>
(pr. taṃsaū́) „zerren, recken”, im Altpreussischen <lbinfo n="12"/> tiains-twei (2. p. Iv. tens-eiti)„wozuanreizen (zum Zorn, zum Glauben)”, im Gothischen at-pins-an „herbeiziehen
 (<lang n="greek"</lang>)”, im Althochdeutschen dinsan (pr. dans) „ziehen”, im Neuhochdeutschen 
gedunsen„angeschwollen”. Es ist hiernach taṃs aus tan (dehnen) durch Erweiterung 
hervorgegangen<lbinfo n="8"/> und „recken, zerren” als die Grundbedeutung 
<lbinfo n="9"/>anzusehen. 
gasyoun commented 6 years ago

There is just one instance of ŷ, under headword taMs.

Im Litauischen ist teṃsti (pr. teṃsiu) „recken, ziehen”, taṃsŷti (pr. taṃsaū́) „zerren, recken” At http://www.sanskrit-lexicon.uni-koeln.de/scans/GRAScan/2014/web/webtc/servepdf.php?page=0509

tans

I see the ŷ. Have to check etymological dictionary.