Closed funderburkjim closed 6 years ago
This process is straightforward. We construct a 'meta' line for each previous headword entry. For instance the 4th entry is changed from
<P>{@a4n3ça,@} m., das als Antheil erlangte (s. 1. aç), daher 1) {%Antheil;%} 2) {%Erbtheil;%} 3)
{%Partei;%} 4) {%der viele Antheile besitzt%} oder {%zu vergeben hat%} und daher 5) Name eines
der Aditisöhne.
<P1>-as 1) 548,12. 5) 192,4; 218,1; 396,5.
<P1>-m 1) 210,5. 2) 279,4. 3) 102,4.
<P1>-a1ya 3) 112,1.
<P1>-a1 [d]. 4) 440,5; 932,9.
<P1>-a1s 1) 857,3.
to
<L>4<pc>0001<k1>aMSa<k2>a4n3ça,
{@a4n3ça,@}¦ m., das als Antheil erlangte (s. 1. aç), daher 1) {%Antheil;%} 2) {%Erbtheil;%} 3)
{%Partei;%} 4) {%der viele Antheile besitzt%} oder {%zu vergeben hat%} und daher 5) Name eines
der Aditisöhne.
<P1>-as 1) 548,12. 5) 192,4; 218,1; 396,5.
<P1>-m 1) 210,5. 2) 279,4. 3) 102,4.
<P1>-a1ya 3) 112,1.
<P1>-a1 [d]. 4) 440,5; 932,9.
<P1>-a1s 1) 857,3.
<LEND>
Note that the meta-line associated with the entry explicitly states that this is the 4th entry (<L>4
) and
provides the page number and the SLP1 form of the headword <k1>aMSa
.
The first entry line is slightly altered (removal of <P>
marker, and insertion of broken vertical bar.
It should be obvious that the original form can be retrieved from the meta-line form, and we check this invertibility desideratum by a programmatic step.
As seen in the above example the 'key2' parameter within the meta-line needs to be further modified
from its AS (letter-number) coding <k2>a4n3ça,
. After conversion the meta-line is
<L>4<pc>0001<k1>aMSa<k2>a/MSa
The trailing comma is also removed. Also other punctuation (such as parentheses and brackets) are also removed from key2 at this stage. Some of these removals probably have significance, which could be examined at a later time and coded in some explicit way. The presence of the broken bar in the first line of the entry provides a hook to facilitate identification of these features in the raw headword form of the text.
There are some interesting details in this conversion from AS to SLP1, which are discussed in the context of the conversion from AS to IAST in the body of the entries.
In 240 metalines, there still remain space or comma characters. Some instances:
<L>51<pc>0006<k1>akzi<k2>a/kzi, akzi/
<L>69<pc>0007<k1>agastya<k2>aga/stya, aga/stia
<L>1396<pc>0160<k1>ah<k2>ah, aMh<h>1
Some of these (such as aMh in the third example) could be treated as alternate headwords.
This part of the conversion for Grassman is the most challenging. There is no Devanagari within this text, and there is no distinction in the typeface between Sanskrit and non-Sanskrit words. All Sanskrit words are presented using the Latin alphabet with diacritics.The main difficulty is deciding how the author's particular brand of diacritics should be interpreted into modern IAST spelling conventions.
The author does provide one useful table in this regard, on page 1:
The highlighted letters are variants from modern IAST.
Here is the conversion used for these variants:
Grassman | Modern IAST |
---|---|
ṙ | ṛ |
r-macron | ṝ |
ḷi | ḷ |
ē | ai |
ō | au |
ṅ | ṃ |
n-macron | ṅ |
ç | ś |
In addition to the diacritics shown in the table, the author also uses diacritics for accents. I have found no discussion of this in the text. So the following conclusions are inferential, sometimes based upon a few comparisons with the accents shown in PWG and MW for similar words.
For short vowels, the acute accent is used almost exclusively (probably for udAtta accent). In fact, based on the coding of the digitization, there are 17 or fewer Sanskrit words coded with a grave accent. From examination of the printed text for a few of these, I think they may be miscodings (or correct coding of poor printing) of circumflexes.
Vowels with circumflex diacritics are abundant. From a very small sample of comparisons to MW/PWG, I concluded that the circumflex represents an acute accent for a long vowel.
There are also a small number of occurrences of the semi-vowels 'v' and 'y' with acute accents; for instance the v in headword 'svar' is presented with an acute accent. My hunch is that these should be considered errors in the author's presentation, but since I have no definite principle for converting these to modern standards, I have left the v-accent and y-accent unchanged.
This table shows all the accent variations, and their conversion. The modern forms for accented 'ai' and accented 'au' are unclear to me -- so I've used 'a + i-acute' and 'a+u-acute'.
Grassman | Converted IAST form | Comment |
---|---|---|
á | á | |
â | ā́ | a-macron + combining acute |
í | í | |
î | ī́ | i-macron + combining acute |
ú | ú | |
û | ū́ | u-macron + combining acute |
ŕ | ṛ́ | r-dotbelow + combining acute |
r-circumflex | ṝ́ | r-dotbelow-macron + combining acute |
é | é | |
ê | aí | |
ó | ó | |
ô | aú | |
v-acute | v́ | v + combining acute |
ý | ý |
As per the above table, we must resort to Unicode combining diacritics for some cases for which there is no separate Unicode code point. This is unfortunate, in my view, since these combining characters are difficult to work with.
There are 227 instances of Greek text in the digitization. Such words appear with the Greek alphabet in the printed text, but remain uncoded in the digitization. In the final form of the digitization they are
represented as <lang n="greek"></lang>
.
There are occasional instances of words in other languages; the ones I've noticed are:
Some of these words also use letters with diacritics. In addition to the extended ascii letters which have a role in Sanskrit IAST, there also appear:
° (\u00b0) 1 := DEGREE SIGN
ã (\u00e3) 5 := LATIN SMALL LETTER A WITH TILDE
þ (\u00fe) 1 := LATIN SMALL LETTER THORN
ě (\u011b) 22 := LATIN SMALL LETTER E WITH CARON
ň (\u0148) 2 := LATIN SMALL LETTER N WITH CARON
ǎ (\u01ce) 4 := LATIN SMALL LETTER A WITH CARON
ǐ (\u01d0) 6 := LATIN SMALL LETTER I WITH CARON
ǔ (\u01d4) 13 := LATIN SMALL LETTER U WITH CARON
ẓ (\u1e93) 1 := LATIN SMALL LETTER Z WITH DOT BELOW
ạ (\u1ea1) 3 := LATIN SMALL LETTER A WITH DOT BELOW
ŷ (\u0177) 1 := LATIN SMALL LETTER Y WITH CIRCUMFLEX
Also ‿ (\u203f) 1086 := UNDERTIE
is used in Sanskrit words between vowels in hiatus.
Some of these additional letters are used in the words of other languages. The number of cases is
small enough to examine by hand, when anyone has the interest in investigating. We could also
introduce <lang>
markup for these words in other languages.
Several other changes were made to bring the markup of gra.txt into line with the conventions used in other of the Cologne digitizations.
<div n="X">
Some of the logical divisions (those with typographical distinctiveness) within entries were coded by Thomas, and have been converted using the 'div' tag.
<div n="H">
section heading on separate line, centered over two columns<div n="P1">
subsection within a two-column segment<div n="P">
similar to P1. Not sure of the distinction.<lbinfo n="N">
As with other dictionaries such as those by Burnouf and Cappeller, Thomas used a vertical bar within a word to indicate that the word was a hyphenated word beginning on one line and ending on the next line. These instances have been recoded by
<lbinfo n="N"/>
whose attribute value indicates the original offset from the end of the vertical bar.Example:
Prǎpo|sitionen
8 characters
Prǎpositionen <lbinfo n="8"/>
--
converted to emdash —
This was used to indicate words where the intra-letter spacing was wide. In contrast to PW and/or PWG, where such letter-spacing has semantic significance, I judged that the 20 or so instances of this coding in Grassman dictionary were without significance, and thus removed this coding.
Quite a few (1000) of the section markings of the digitization were preceded by one or two § characters.
§§<P>, §<P>, $$<P1>
. I could determine no distinctive meaning for these, and hence removed
this § character.
In some other digitizations, the ellipsis character … was introduced by Thomas as a sort of markup not present in the text. However, in gra.txt, this character seems to correspond to a certain 'squiggle' character in the print; hence the … character is retained.
For instance, under headword agnidagDa
<L>85<pc>0009<k1>agnidagDa<k2>agni-dagDa/
{@agni-dagdhá,@}¦ a., {%von Feuer verbrannt%} (dah); daher 1) von den verbrannten Leichen, 2) von den vom Blitzstrahl getroffenen; siehe án-agnidagdha.
<div n="P1">-ás 1) 841,14 (yé … yé ánagnidagdhās). <<<< ELLIPSIS
<div n="P1">-ānām 2) (Ton auf í) 929,15 (im pariśiṣṭa zu 929).
<LEND>
Headwords akrIQat, dU (hom 1 and hom 2), mar, (2nd form), varc, and SaS were not properly coded as headwords, but now have been added (with new decimal L-numbers) in this conversion.
That's all folks!
With my current near-kARa vision, I can get some of these conversions out of the way.
added (with new decimal L-numbers
Hurray!
could be treated as alternate headwords
Agree.
This is unfortunate, in my view, since these combining characters are difficult to work with.
Yes, it's hard and dirty. AS hads it's logic behind, but not all of it is self evident. Some of the codepoints will always remain deficient. So be it. Unicode is better that AS, even if not as logical and strict as it.
The number of cases is small enough to examine by hand, when anyone has the interest in investigating.
Let me give it a try.
In preparing another summary of the Grassman IAST conventions, I came upon page 3 of the forward. I think some of this is relevant to his coding, including of accents. Could someone with German proficiency provide a translation or paraphrase?
While this translation is of some use, someone with German/English knowledge needs to improve upon the rather awkward english1 translation, using the pdf as the source of German.
Could someone with German proficiency provide a translation or paraphrase?
Let's first ask the submitters of errors from Germany - please email them. If none, I will. Was thinking about this page and yes, it is related.
There are also a small number of occurrences of the semi-vowels 'v' and 'y' with acute accents; for instance the v in headword 'svar' is presented with an acute accent. My hunch is that these should be considered errors in the author's presentation, but since I have no definite principle for converting these to modern standards, I have left the v-accent and y-accent unchanged.
Sv́ar is given as alternative for súar, it is not an eror. There was original udatta accent in U, but by sandhi U is changes to V. Actually this gives svarita accent in the following A, commonly marked with grave accent: svàr.
There is just one instance of ŷ, under headword taMs. From surrounding text, I conjecture taṃsŷti is a rendering of a word in Lithuanian language. Can someone confirm?
Die Grundbedeutung ist aus dem Sanskrit nicht mit Sicherheit zu entwickeln, wol aber aus den
verwandten Sprachen. Im Litauischen ist teṃsti (pr. teṃsiu) „recken, ziehen”, taṃsŷti <lbinfo n="4"/>
(pr. taṃsaū́) „zerren, recken”, im Altpreussischen <lbinfo n="12"/> tiains-twei (2. p. Iv. tens-eiti)„wozuanreizen (zum Zorn, zum Glauben)”, im Gothischen at-pins-an „herbeiziehen
(<lang n="greek"</lang>)”, im Althochdeutschen dinsan (pr. dans) „ziehen”, im Neuhochdeutschen
gedunsen„angeschwollen”. Es ist hiernach taṃs aus tan (dehnen) durch Erweiterung
hervorgegangen<lbinfo n="8"/> und „recken, zerren” als die Grundbedeutung
<lbinfo n="9"/>anzusehen.
There is just one instance of ŷ, under headword taMs.
Im Litauischen ist teṃsti (pr. teṃsiu) „recken, ziehen”, taṃsŷti (pr. taṃsaū́) „zerren, recken” At http://www.sanskrit-lexicon.uni-koeln.de/scans/GRAScan/2014/web/webtc/servepdf.php?page=0509
I see the ŷ. Have to check etymological dictionary.
This issue is for notes relating to changes in the digitization gra.txt (Grassman Wörterbuch zum Rig Veda).
The general format is changed to be consistent with the so-called meta-line format.
Also, non-ascii characters in the digitization are represented using Unicode. In addition, several other changes are made so that coding conventions of this dictionary are more similar to those in other converted dictionaries.
Some further details of the conversion will be mentioned in subsequent comments.