Closed funderburkjim closed 5 years ago
The hardest nut. Most software based on the XML ignores the markup, so we are free to experiment.
As with MW72, my objective is to convert the current AS (letter-number) coding to Unicode characters, and furthermore to bring the coding of Sanskrit words into conformance with current IAST conventions. As the comments in #215 show, this is a complicated subject. Thus a separate issue #218 is devoted to discussion of this subject.
Part of the conversion goal is to make the markup of MW more like the markup used in other Cologne digitizations. Here is what has been done along those lines thus far. I have copied tag descriptions from the mwtags.html document mentioned above.
TAG | change | description |
---|---|---|
<hc1>X</hc1> |
removed | This is a vestigial element carried over from MONIER.ALL. |
<hc3>X</hc3> |
removed | a 3-digit code, assigned in MONIER.ALL as a classification of the form of records. These are not particularly reliable in the current version of the dictionary. |
<key1>X</key1> |
removed | duplicate of meta-line k1 field. Will reappear in new mw.xml |
<key2>X</key2> |
change to X ¦ | appears in meta-line k2 field, but without xml. Everything in the original key2 appears on the first line of entry preceding the ¦. k2 field will reappear in new mw.xml as <key2> field. |
<MW>X</MW> |
removed | a sequencing number in MONIER.ALL; has been superceded by <L> entry for purposes of record identification |
<mat/> |
removed | coding of a particular '@' character in MONIER.ALL; meaning unclear |
<mul/> |
removed | coding of a particular '_' character in MONIER.ALL; meaning unclear |
<pc type="rev"> |
The "rev" is duplicative. All such have an <L revL="X">Y</L> element indicating a revision from Appendix |
|
<mscverb/> |
removed | originally inserted to identify records which were verbs with multiple parts separated by the <msc/> element. |
<indoitalic/> |
removed | No documentation found for this. 3809 instances, all in <tail> |
<pc>PageX.Y</pc> |
change to <pc>X,Y</pc> |
No reason for the 'Page' string |
<L revL="(.*?)"> |
Change to <L><info n="rev"/> |
No other dictionaries have attribute on <L> . See note below |
<L supL="(.*?)"> |
Change to <L><info n="sup"/> |
No other dictionaries have attribute on <L> . See note below |
<tail> |
removed | In the mw.txt being constructed, there is no content left. It is elsewhere, e.g. meta-line. Will reappear in construction of new mw.xml |
<H1> ,etc |
moved to <e>1 in meta-line |
This important MW line information has been moved to the meta-line, and removed as a tag. See the example of aMSu in the next comment. |
<L revL>
etc.The ADDITIONS AND CORRECTIONS
text appearing on pages 1308 ff. was originally coded separately. In this coding, these entries had separate record identifiers L in the range 300000+. At some point, it was decided to integrate these corrections into the body of the text. For example, the supplemental
entry for aMSarUpiRI was inserted between aMSaBUta (L=27) and aMSavat (L=28), and given record number L=27.1.
While this work was being debugged, it was felt to be useful to retain the L-numbers of the previous
version. The solution we came up with was to add an attribute to the new L-number. In the case of
aMSarUpiRI the form was <L supL="300010">27.1</L>
, since 300010 was the old L-number.
Similarly, for entries from the ADDITIONS AND CORRECTIONS deemed to be a revision of an existing entry (rather than an addition), the 'revL' attribute was introduced. For instance, under headword 'aMSu' on page 1, col 2, we see in the text
The correction section shows:
In such a case, the correction was folded into the previous record:
<H1A><h><hc3>100</hc3><key1>aMSu</key1><hc1>1</hc1><key2>aMSu/</key2></h>
<body> <lex type="inh">m.</lex> a ray , sunbeam </body><tail>
<pc>1,2</pc><pc type="rev">1308,1</pc> <L revL="300020">52</L></tail></H1A>
Here is how it is changed in the current conversion:
<L>52<pc>1,2<k1>aMSu<k2>aMSu/<e>1A
<s>aMSu/</s> ¦ <lex type="inh">m.</lex> a ray , sunbeam <pc>1308,1</pc> <info n="rev"/>
<LEND>
Note that we've coded the fact that this entry was adjusted based on the A/C section by the
<info n="rev"/>
tag, and also have retained the page-column of the A/C section <pc>1308,1</pc>
.
Also note that the main page-column of the entry is available in the meta-line as <pc>1,2
.
Since the previous version (with the L-numbers 300000+) is no longer available, the retention of these numbers has no referential use. They will still be present in a (saved) version (mw1.xml).
retention of these numbers has no referential use
Agree. Mesmerizing work, Jim, as usual.
TAG | change | description |
---|---|---|
<eq/> |
= | mwtags.html says: Represents '=' character which appears in definitions in the form 'word1 = word2'. This distinguishes such usages of '=' from others, such as in XML markup involving attributes. Current view: No need to retain this distinction. Just use '=' |
<amp/> |
& | Ampersand character. Will be converted to & in mw.xml |
<etc/> |
&c. | Revert to actual MW print. No need for special tag. |
<etc1/> |
&c. | Revert to actual MW print. No need for special tag. |
<etcetc/> |
&c. &c. | Revert to actual MW print. No need for special tag. |
<auml/> |
ä | Use unicode character rather than special tag. |
<euml/> |
ë | Use unicode character rather than special tag. |
<uuml/> |
ü | Use unicode character rather than special tag. |
<ouml/> |
ö | Use unicode character rather than special tag. |
<ccom/> |
removed | mwtags.html says: Represents a character in MONIER.ALL used to represent certain complex records that needed further attention. Currently, there are only about 100 instances; these elements are probably now vestigial and should be removed. |
<sr/> |
° Unicode DEGREE SIGN | mwtags.html says: Represents, within Sanskrit text, the small superscript circle used within the dictionary to represent omitted characters. This one of the numerous abbreviation techniques used in the dictionary. See Note |
<sr1/> |
° Unicode DEGREE SIGN | Same as <sr/> . Don't know why there were two symbols for the same thing. |
<srs1/> |
<srs/> |
<srs1/> is duplicative of <srs/> . In text within <s> tag, this tag follows a vowel which appears in the printed MW text with a circumflex; For instance <s>rAje<srs/>ndra</s> means the 'e' is result of sandhi from rAja+indra. I would rather use a single unicode character instead of the <srs/> , but not sure of ramifications |
<fcom/> |
removed | An alternate for DEGREE SIGN, or ů (3 times in Lithuanian) |
<ns>X</ns> |
changed <s>X</s> |
Originally thought to be non-sanskrit. |
<shc/> |
changed <shortlong/> |
mwtags.html says: represents, within Sanskrit text, the superscript symbol (line below semi-circle) used above a vowel to indicate that the vowel may be either in short or long form. NOTE 1: In IAST, this could be represented as vowel+macron+breve. NOTE 2: These 200+ could be source of new alternate headwords |
<fs/> |
/ | Represents a fraction or alternates. Example: 1<fs/>2 changed to 1/2 |
<see/> |
See | Cross reference to another entry. Often (always?) capitalized. |
<see type="nonhier"/> |
See | Cross reference to another entry. 4470 instances. Significance of "nonhier" not clear; not used in display. Best to remove this meaningless distinction. |
</ls> <ls> |
</ls> ; <ls> |
Restore punctuation between literary source. The semicolon is not always right; by putting this estimate into mw.txt, it will be correctable. |
<quote> |
‘ LEFT SINGLE QUOTATION MARK | ‘ character not used elsewhere; Similar to printed text; markup unneeded |
</quote> |
’ RIGHT SINGLE QUOTATION MARK | ’ character not used elsewhere; Similar to printed text; markup unneeded |
<usage>X<\usage> |
X | experimental markup only used twice. See Note |
<idiom>X<\idiom> |
X | experimental markup only used twice. |
<sense>X<\sense> |
X | experimental markup only used twice. |
<ellipsis/> |
— | EM DASH. Used only in the experimental markup |
-- | — | Replace 2 or more hyphens with em-dash. Normally in raw form of compound headwords. There is no consistent meaning among '--','---', etc. So replace all by the emdash, which is visually similar to the printed text. |
<qv/> |
<ab>q.v.</ab> |
More conformant to text and consisten with current dsplay |
<cf/> |
<ab>cf.</ab> |
More conformant to text and consisten with current dsplay |
<pron>X</pron> |
<info lexcat="pron:X"/> |
Meta information. This was introduced as a way to identify different entries as the same pronoun. e.g. hw = 'ma' (hom 3) refers to pronoun X=asmad. X is slp1 spelling. |
<card>X</card> |
<info lexcat="card:X"/> |
Meta information for cardinal number words. |
<loan/> |
<info lexcat="loan"/> |
Meta information for Sanskrit words which are proper names loaned from another language. Occurs currently only in headwords sArisTAKA, sAhebrAm, humAuM |
<msc/> |
;<div n="vp"/> |
'msc' = 'Malten semicolon'. mwtags.html says: "Represents ';' when deemed to have a 'sense-separator' function. Occurs almost exclusively within records for verbs...." I think now only within verbs. ';' followed by 'div' markup makes a similar indication for display programs. 'vp' == 'verb paragraph'. |
<pc>X,Y</pc> |
<pb n="X,Y"/> |
indicates page break within entry. Only 700+ marked. Change of format better xml practice as the 'X,Y' (Page,Col) is not part of text |
This character is used in revised mw.txt to indicate an incomplete spelling, sort of like an abbreviation. In digitizations of some other dictionaries, the Unicode º MASCULINE ORDINAL INDICATOR character is used for this same purpose; We should be consistent across dictionaries in this detail. In some places, notably in French dictionaries, the º MASCULINE ORDINAL INDICATOR is appropriate; for instance 1º, etc may appropriately used this M.O.I. character, as I understand it.
<usage>
example:Here is example:
<H1><h><hc3>200</hc3><key1>aDa</key1><hc1>1</hc1><key2>a/Da</key2></h><body>
<OR group="3951,aDa;3953.2,aDA"/> or <s>a/DA</s> <lex>ind.</lex> , <ab>Ved.</ab> (<eq/>
<s>a/Ta</s> , used chiefly as an inceptive particle) , now , then , therefore , moreover , so much the
more , and , partly.
<usage><idiom><s>a/Da</s><ellipsis/><s>a/Da</s></idiom><sense> as well as , partly partly.
</sense></usage>
</body><tail><MW>002776</MW> <pc>19,3</pc> <L>3951</L></tail></H1>
The former coding used an index into a supplementary file as the way to actually get the underlying Greek unicode text for displays. This is changed so that the Greek text is part of the 'mw.txt' digitization.
The 3rd homonym of headword a
illustrates this.
Old:
<H1><h><hc3>000</hc3><key1>a</key1><hc1>1</hc1><key2>a</key2><hom>3</hom></h>
<body> ( before a vowel <s>an</s> , exc. <s>a-fRin</s>) , a prefix corresponding to
<ab>Gk.</ab> <gk>1</gk> , <gk>2</gk> ,
<ab>Lat.</ab> <etym>in</etym> , <ab>Goth.</ab> and <ab>Germ.</ab> <etym>un</etym> ,
<ab>Eng.</ab> <etym>in</etym> or <etym>un</etym> , and having a negative or privative or
contrary sense (<s>an-eka</s> not one ; <s>an-anta</s> endless ; <s>a-sat</s> not good ; <s>a-
paSyat</s> not seeing ) </body><tail><mul/> <pc>1,1</pc> <L>4</L></tail></H1>
New:
<L>4<pc>1,1<k1>a<k2>a<h>3<e>1
<s>a</s> <hom>3</hom> ¦ ( before a vowel <s>an</s> , exc. <s>a-fRin</s>) ,
a prefix corresponding to
<ab>Gk.</ab> <lang n="greek">ἀ</lang> , <lang n="greek">ἀν</lang> ,
<ab>Lat.</ab> <etym>in</etym> , <ab>Goth.</ab> and <ab>Germ.</ab> <etym>un</etym> ,
<ab>Eng.</ab> <etym>in</etym> or <etym>un</etym> , and having a negative or privative or
contrary sense (<s>an-eka</s> not one ; <s>an-anta</s> endless ; <s>a-sat</s> not good ;
<s>a-paSyat</s> not seeing )
<LEND>
🍾
Hurray, long live the Jim.
This issue devoted to comments regarding meta-line/iast conversion of the Cologne digitization of
Monier-Williams Sanskrit-English Dictionary, 1899
.This conversion will present some unique challenges.
We'll have to carefully consider the pluses/minuses associated with breaking The current best documentation of the xml tags is mwtags.html.