sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

MW meta/iast conversion #216

Closed funderburkjim closed 5 years ago

funderburkjim commented 6 years ago

This issue devoted to comments regarding meta-line/iast conversion of the Cologne digitization of Monier-Williams Sanskrit-English Dictionary, 1899.

This conversion will present some unique challenges.

gasyoun commented 6 years ago

The hardest nut. Most software based on the XML ignores the markup, so we are free to experiment.

funderburkjim commented 6 years ago

IAST in MW

As with MW72, my objective is to convert the current AS (letter-number) coding to Unicode characters, and furthermore to bring the coding of Sanskrit words into conformance with current IAST conventions. As the comments in #215 show, this is a complicated subject. Thus a separate issue #218 is devoted to discussion of this subject.

funderburkjim commented 6 years ago

MW tag changes, part 1

Part of the conversion goal is to make the markup of MW more like the markup used in other Cologne digitizations. Here is what has been done along those lines thus far. I have copied tag descriptions from the mwtags.html document mentioned above.

TAG change description
<hc1>X</hc1> removed This is a vestigial element carried over from MONIER.ALL.
<hc3>X</hc3> removed a 3-digit code, assigned in MONIER.ALL as a classification of the form of records. These are not particularly reliable in the current version of the dictionary.
<key1>X</key1> removed duplicate of meta-line k1 field. Will reappear in new mw.xml
<key2>X</key2> change to X ¦ appears in meta-line k2 field, but without xml. Everything in the original key2 appears on the first line of entry preceding the ¦. k2 field will reappear in new mw.xml as <key2> field.
<MW>X</MW> removed a sequencing number in MONIER.ALL; has been superceded by <L> entry for purposes of record identification
<mat/> removed coding of a particular '@' character in MONIER.ALL; meaning unclear
<mul/> removed coding of a particular '_' character in MONIER.ALL; meaning unclear
<pc type="rev"> , drop attribute The "rev" is duplicative. All such have an Y element. The "rev" is duplicative. All such have an <L revL="X">Y</L> element indicating a revision from Appendix
<mscverb/> removed originally inserted to identify records which were verbs with multiple parts separated by the <msc/> element.
<indoitalic/> removed No documentation found for this. 3809 instances, all in <tail>
<pc>PageX.Y</pc> change to <pc>X,Y</pc> No reason for the 'Page' string
<L revL="(.*?)"> Change to <L><info n="rev"/> No other dictionaries have attribute on <L>. See note below
<L supL="(.*?)"> Change to <L><info n="sup"/> No other dictionaries have attribute on <L>. See note below
<tail> removed In the mw.txt being constructed, there is no content left. It is elsewhere, e.g. meta-line. Will reappear in construction of new mw.xml
<H1>,etc moved to <e>1 in meta-line This important MW line information has been moved to the meta-line, and removed as a tag. See the example of aMSu in the next comment.
funderburkjim commented 6 years ago

Note on <L revL> etc.

The ADDITIONS AND CORRECTIONS text appearing on pages 1308 ff. was originally coded separately. In this coding, these entries had separate record identifiers L in the range 300000+. At some point, it was decided to integrate these corrections into the body of the text. For example, the supplemental entry for aMSarUpiRI was inserted between aMSaBUta (L=27) and aMSavat (L=28), and given record number L=27.1. While this work was being debugged, it was felt to be useful to retain the L-numbers of the previous version. The solution we came up with was to add an attribute to the new L-number. In the case of aMSarUpiRI the form was <L supL="300010">27.1</L>, since 300010 was the old L-number.

Similarly, for entries from the ADDITIONS AND CORRECTIONS deemed to be a revision of an existing entry (rather than an addition), the 'revL' attribute was introduced. For instance, under headword 'aMSu' on page 1, col 2, we see in the text image

The correction section shows: image

In such a case, the correction was folded into the previous record:

<H1A><h><hc3>100</hc3><key1>aMSu</key1><hc1>1</hc1><key2>aMSu/</key2></h>
<body> <lex type="inh">m.</lex> a ray , sunbeam </body><tail>
<pc>1,2</pc><pc type="rev">1308,1</pc> <L revL="300020">52</L></tail></H1A>

Here is how it is changed in the current conversion:

<L>52<pc>1,2<k1>aMSu<k2>aMSu/<e>1A
<s>aMSu/</s> ¦  <lex type="inh">m.</lex> a ray , sunbeam <pc>1308,1</pc> <info n="rev"/>
<LEND>

Note that we've coded the fact that this entry was adjusted based on the A/C section by the <info n="rev"/> tag, and also have retained the page-column of the A/C section <pc>1308,1</pc>. Also note that the main page-column of the entry is available in the meta-line as <pc>1,2.

Removal of old L-number (e.g. 300010) is non-material

Since the previous version (with the L-numbers 300000+) is no longer available, the retention of these numbers has no referential use. They will still be present in a (saved) version (mw1.xml).

gasyoun commented 6 years ago

retention of these numbers has no referential use

Agree. Mesmerizing work, Jim, as usual.

funderburkjim commented 6 years ago

MW tag changes, part 2

TAG change description
<eq/> = mwtags.html says: Represents '=' character which appears in definitions in the form 'word1 = word2'. This distinguishes such usages of '=' from others, such as in XML markup involving attributes. Current view: No need to retain this distinction. Just use '='
<amp/> & Ampersand character. Will be converted to & in mw.xml
<etc/> &c. Revert to actual MW print. No need for special tag.
<etc1/> &c. Revert to actual MW print. No need for special tag.
<etcetc/> &c. &c. Revert to actual MW print. No need for special tag.
<auml/> ä Use unicode character rather than special tag.
<euml/> ë Use unicode character rather than special tag.
<uuml/> ü Use unicode character rather than special tag.
<ouml/> ö Use unicode character rather than special tag.
<ccom/> removed mwtags.html says: Represents a character in MONIER.ALL used to represent certain complex records that needed further attention. Currently, there are only about 100 instances; these elements are probably now vestigial and should be removed.
<sr/> ° Unicode DEGREE SIGN mwtags.html says: Represents, within Sanskrit text, the small superscript circle used within the dictionary to represent omitted characters. This one of the numerous abbreviation techniques used in the dictionary. See Note
<sr1/> ° Unicode DEGREE SIGN Same as <sr/>. Don't know why there were two symbols for the same thing.
<srs1/> <srs/> <srs1/> is duplicative of <srs/>. In text within <s> tag, this tag follows a vowel which appears in the printed MW text with a circumflex; For instance <s>rAje<srs/>ndra</s> means the 'e' is result of sandhi from rAja+indra. I would rather use a single unicode character instead of the <srs/>, but not sure of ramifications
<fcom/> removed An alternate for DEGREE SIGN, or ů (3 times in Lithuanian)
<ns>X</ns> changed <s>X</s> Originally thought to be non-sanskrit.
<shc/> changed <shortlong/> mwtags.html says: represents, within Sanskrit text, the superscript symbol (line below semi-circle) used above a vowel to indicate that the vowel may be either in short or long form. NOTE 1: In IAST, this could be represented as vowel+macron+breve. NOTE 2: These 200+ could be source of new alternate headwords
<fs/> / Represents a fraction or alternates. Example: 1<fs/>2 changed to 1/2
<see/> See Cross reference to another entry. Often (always?) capitalized.
<see type="nonhier"/> See Cross reference to another entry. 4470 instances. Significance of "nonhier" not clear; not used in display. Best to remove this meaningless distinction.
</ls> <ls> </ls> ; <ls> Restore punctuation between literary source. The semicolon is not always right; by putting this estimate into mw.txt, it will be correctable.
<quote> ‘ LEFT SINGLE QUOTATION MARK ‘ character not used elsewhere; Similar to printed text; markup unneeded
</quote> ’ RIGHT SINGLE QUOTATION MARK ’ character not used elsewhere; Similar to printed text; markup unneeded
<usage>X<\usage> X experimental markup only used twice. See Note
<idiom>X<\idiom> X experimental markup only used twice.
<sense>X<\sense> X experimental markup only used twice.
<ellipsis/> EM DASH. Used only in the experimental markup
-- Replace 2 or more hyphens with em-dash. Normally in raw form of compound headwords. There is no consistent meaning among '--','---', etc. So replace all by the emdash, which is visually similar to the printed text.
<qv/> <ab>q.v.</ab> More conformant to text and consisten with current dsplay
<cf/> <ab>cf.</ab> More conformant to text and consisten with current dsplay
<pron>X</pron> <info lexcat="pron:X"/> Meta information. This was introduced as a way to identify different entries as the same pronoun. e.g. hw = 'ma' (hom 3) refers to pronoun X=asmad. X is slp1 spelling.
<card>X</card> <info lexcat="card:X"/> Meta information for cardinal number words.
<loan/> <info lexcat="loan"/> Meta information for Sanskrit words which are proper names loaned from another language. Occurs currently only in headwords sArisTAKA, sAhebrAm, humAuM
<msc/> ;<div n="vp"/> 'msc' = 'Malten semicolon'. mwtags.html says: "Represents ';' when deemed to have a 'sense-separator' function. Occurs almost exclusively within records for verbs...." I think now only within verbs. ';' followed by 'div' markup makes a similar indication for display programs. 'vp' == 'verb paragraph'.
<pc>X,Y</pc> <pb n="X,Y"/> indicates page break within entry. Only 700+ marked. Change of format better xml practice as the 'X,Y' (Page,Col) is not part of text

Note on ° Unicode DEGREE SIGN

This character is used in revised mw.txt to indicate an incomplete spelling, sort of like an abbreviation. In digitizations of some other dictionaries, the Unicode º MASCULINE ORDINAL INDICATOR character is used for this same purpose; We should be consistent across dictionaries in this detail. In some places, notably in French dictionaries, the º MASCULINE ORDINAL INDICATOR is appropriate; for instance 1º, etc may appropriately used this M.O.I. character, as I understand it.

Note on<usage> example:

Here is example:


<H1><h><hc3>200</hc3><key1>aDa</key1><hc1>1</hc1><key2>a/Da</key2></h><body>
<OR group="3951,aDa;3953.2,aDA"/> or <s>a/DA</s> <lex>ind.</lex> , <ab>Ved.</ab> (<eq/> 
<s>a/Ta</s> , used chiefly as an inceptive particle) , now , then , therefore , moreover , so much the 
more , and , partly. 

<usage><idiom><s>a/Da</s><ellipsis/><s>a/Da</s></idiom><sense> as well as , partly partly. 
</sense></usage> 

</body><tail><MW>002776</MW> <pc>19,3</pc> <L>3951</L></tail></H1>
funderburkjim commented 6 years ago

Coding of Greek text

The former coding used an index into a supplementary file as the way to actually get the underlying Greek unicode text for displays. This is changed so that the Greek text is part of the 'mw.txt' digitization.

The 3rd homonym of headword a illustrates this. Old:

<H1><h><hc3>000</hc3><key1>a</key1><hc1>1</hc1><key2>a</key2><hom>3</hom></h>
<body> ( before a vowel <s>an</s> , exc. <s>a-fRin</s>) , a prefix corresponding to 
<ab>Gk.</ab> <gk>1</gk> , <gk>2</gk> , 
<ab>Lat.</ab> <etym>in</etym> , <ab>Goth.</ab> and <ab>Germ.</ab> <etym>un</etym> , 
<ab>Eng.</ab> <etym>in</etym> or <etym>un</etym> , and having a negative or privative or 
contrary sense (<s>an-eka</s> not one ; <s>an-anta</s> endless ; <s>a-sat</s> not good ; <s>a-
paSyat</s> not seeing ) </body><tail><mul/> <pc>1,1</pc> <L>4</L></tail></H1>

New:

<L>4<pc>1,1<k1>a<k2>a<h>3<e>1
<s>a</s> <hom>3</hom> ¦  ( before a vowel <s>an</s> , exc. <s>a-fRin</s>) , 
a prefix corresponding to 
<ab>Gk.</ab> <lang n="greek">ἀ</lang> , <lang n="greek">ἀν</lang> , 
<ab>Lat.</ab> <etym>in</etym> , <ab>Goth.</ab> and <ab>Germ.</ab> <etym>un</etym> , 
<ab>Eng.</ab> <etym>in</etym> or <etym>un</etym> , and having a negative or privative or 
contrary sense (<s>an-eka</s> not one ; <s>an-anta</s> endless ; <s>a-sat</s> not good ; 
<s>a-paSyat</s> not seeing ) 
<LEND>
funderburkjim commented 6 years ago

Revision installed.

🍾

gasyoun commented 6 years ago

Hurray, long live the Jim.