sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

stc iast and meta-line conversions #182

Closed funderburkjim closed 7 years ago

funderburkjim commented 7 years ago

This issue devoted to conversion of Stchoupak dictionary (stc.txt) to meta-line format. The conversion from AS (number-letter) coding of letters with diacritics to Unicode (both for French words and Sanskrit words, IAST) will be done at this time also.

funderburkjim commented 7 years ago

correct print error in headword

Print forgot 'sam'

old:
{@uc-chrita-@}¦ a. v. (monté) haut, élevé, dressé.
new:
{@sam-uc-chrita-@}¦ a. v. (monté) haut, élevé, dressé.

image

funderburkjim commented 7 years ago

lbinfo

As with Burnouf, original coding of stc.txt uses a vertical bar to denote a line-break in the middle of a word. This was changed as follows, also as in Burnouf:

X|Y -> XY <lbinfo n="k"/>     
where k = length of substring Y.
For example (see sum-uc-chvas above):   ha|leine -> haleine <lbinfo n="5"/>

This is a nice compromise that we might use elsewhere e.g., in ap90, to avoid the problems with parsing and interpreting words which have a line-break in the printed text.

funderburkjim commented 7 years ago

Greek

There is only one instance in stc.txt of Greek text:

<L>13902<pc>442,1<k1>puzkarAvatI<k2>puzkarAvatI
{@puṣkarāvatī-@}¦ f. n. d'une ville (<lang n="greek"></lang> des anciens et Pousekielofati
de HiouenThsang <lbinfo n="6"/>) entre Indus et Swat, capitale des Gāndhāra.
<LEND>

image

Maybe @jmigliori or someone else can provide the Greek Unicode here.

funderburkjim commented 7 years ago

Homonyms and key2

In the previous version of stc.txt, there is no identification of the homonym numbers which appear in 1149 headwords. Also, in this previous version, the key2 contents have not been properly converted to SLP1.

Changes were made in both of these case in the meta-line version of stc.txt. For instance,

HOMONYM
OLD (stc.xml)
<H1><h><key1>akza</key1><key2>1akza-</key2></h>  .....

NEW
<L>102<pc>3,2<k1>akza<k2>akza<h>1
{@1 akṣa-@}¦ m. dé (à jouer); n. d'une plante (Terminalia Bellerica).
(stc.xml)
<H1><h><key1>akza</key1><key2>akza</key2><hom>1</hom></h><body><b>1 akṣa-</b>¦
KEY2 
OLD (stc.xml)  -- note key2 not in slp1
<H1><h><key1>aYc</key1><key2>AN5C-</key2></h>

NEW 
<H1><h><key1>aYc</key1><key2>aYc</key2></h><body><b>AÑC-</b>¦

Note

Many recent dictionaries which have been converted to meta-line format do NOT have homonyms. In working through the examples above, I noticed that in this stc example, there was a bug in hwparse.py and hw.py. Our convention for the homonym field in the meta-line is to use <h>, but the code assumed the convention was to use <hom> . However, in the xml form, we do use <hom>. Will need to be vigilant to assure consistency in this detail.

funderburkjim commented 7 years ago

This conversion now finished: 🤞

funderburkjim commented 7 years ago

IAST

The AS (letter-number) coding was converted to Unicode.

Bold text and italic text identified as Sanskrit; in particular, in such contexts, the c-with-cedilla Ç was converted to IAST s-with-acute Ś.

In other text, the assumption made was that the text was French, and thus Ç was left unchanged. There are several hundred cases where this choice could be considered wrong, notably Çiva, Çakti, etc. On the other hand, it could be considered that these are instances where a Sanskrit word has been in effect adopted within the French language, and therefore the Ç is appropriate. Just mention this in passing, since it is a case where I can see reasonable arguments on both sides. Having some in italic-bold spelled Śiva and some in non-italic-bold spelled Çiva does make body text searches harder.

jmigliori commented 7 years ago

Πευκελαῶτις

funderburkjim commented 7 years ago

@jmigliori
Thanks, Jonathan!