Closed funderburkjim closed 7 years ago
Print forgot 'sam'
old:
{@uc-chrita-@}¦ a. v. (monté) haut, élevé, dressé.
new:
{@sam-uc-chrita-@}¦ a. v. (monté) haut, élevé, dressé.
As with Burnouf, original coding of stc.txt uses a vertical bar to denote a line-break in the middle of a word. This was changed as follows, also as in Burnouf:
X|Y -> XY <lbinfo n="k"/>
where k = length of substring Y.
For example (see sum-uc-chvas above): ha|leine -> haleine <lbinfo n="5"/>
This is a nice compromise that we might use elsewhere e.g., in ap90, to avoid the problems with parsing and interpreting words which have a line-break in the printed text.
There is only one instance in stc.txt of Greek text:
<L>13902<pc>442,1<k1>puzkarAvatI<k2>puzkarAvatI
{@puṣkarāvatī-@}¦ f. n. d'une ville (<lang n="greek"></lang> des anciens et Pousekielofati
de HiouenThsang <lbinfo n="6"/>) entre Indus et Swat, capitale des Gāndhāra.
<LEND>
Maybe @jmigliori or someone else can provide the Greek Unicode here.
In the previous version of stc.txt, there is no identification of the homonym numbers which appear in 1149 headwords. Also, in this previous version, the key2 contents have not been properly converted to SLP1.
Changes were made in both of these case in the meta-line version of stc.txt. For instance,
HOMONYM
OLD (stc.xml)
<H1><h><key1>akza</key1><key2>1akza-</key2></h> .....
NEW
<L>102<pc>3,2<k1>akza<k2>akza<h>1
{@1 akṣa-@}¦ m. dé (à jouer); n. d'une plante (Terminalia Bellerica).
(stc.xml)
<H1><h><key1>akza</key1><key2>akza</key2><hom>1</hom></h><body><b>1 akṣa-</b>¦
KEY2
OLD (stc.xml) -- note key2 not in slp1
<H1><h><key1>aYc</key1><key2>AN5C-</key2></h>
NEW
<H1><h><key1>aYc</key1><key2>aYc</key2></h><body><b>AÑC-</b>¦
Many recent dictionaries which have been converted to meta-line format do NOT have homonyms.
In working through the examples above, I noticed that in this stc example, there was a bug in
hwparse.py and hw.py. Our convention for the homonym field in the meta-line is to use <h>
, but
the code assumed the convention was to use <hom>
. However, in the xml form, we do use <hom>
.
Will need to be vigilant to assure consistency in this detail.
This conversion now finished: 🤞
The AS (letter-number) coding was converted to Unicode.
Bold text and italic text identified as Sanskrit; in particular, in such contexts, the c-with-cedilla Ç
was converted to IAST s-with-acute Ś
.
In other text, the assumption made was that the text was French, and thus Ç
was left unchanged.
There are several hundred cases where this choice could be considered wrong, notably Çiva
, Çakti
, etc.
On the other hand, it could be considered that these are instances where a Sanskrit word has been in effect adopted within the French language, and therefore the Ç
is appropriate. Just mention this in passing, since it is a case where I can see reasonable arguments on both sides. Having some in italic-bold spelled Śiva
and some in non-italic-bold spelled Çiva
does make body text searches harder.
Πευκελαῶτις
@jmigliori
Thanks, Jonathan!
This issue devoted to conversion of Stchoupak dictionary (stc.txt) to meta-line format. The conversion from AS (number-letter) coding of letters with diacritics to Unicode (both for French words and Sanskrit words, IAST) will be done at this time also.