sanskrit-lexicon / mw-dev

Development version of MW dictionary, to collaborate with Andhrabharati
1 stars 0 forks source link

mw_ab observations, 1 #12

Open funderburkjim opened 1 year ago

funderburkjim commented 1 year ago

These comments made on mw_ab as of commit 52c972817f25b591784c66b81e428b606336bcd1

orig

all 4 mw files have the same number of lines.

features of mw_AB.txt

'; delete' lines

Marks lines to be removed in mw_AB. These '; delete' lines are placeholders that facilitate comparison The lines so marked include:

metalines ending in <e>#X and accompanying <LEND>

Note Found 4 similar lines not marked as '; delete'

4 matches for "<e>[0-9][A-E]" in buffer: mw_AB.txt
  34904:<L>9430.1<pc>1314,1<k1>aparagodāni<k2>apara—godāni<e>3C=
 122269:<L>34449.1<pc>1322,1<k1>upacaryā<k2>upa-caryā<e>2B
 129904:<L>36686.1<pc>1322,2<k1>upāhita<k2>upāhita<e>2B
 137824:<L>39016.1<pc>1323,1<k1>ṛṣabhā<k2>ṛṣabhā<e>2B

tab character before broken bar

All broken bar characters are preceded by a 'tab' character One exception

1 match for "[^ ]¦" in buffer: mw_AB.txt
 257848:<s>chinnaka—tara</s> <lex>mfn.</lex> (<ab>compar.</ab>), ¦ <ls>Pāṇ. v, 3, 72</ls>, <ab>Vārtt.</ab> 5.

<srs/> markup change

Left-Pointing Angle Bracket U2329 and Right-Pointing Angle Bracket U232A enclose vowel (or accented vowel) Example:

<s>prā<srs/>s</s> -> <s>pr〈ā〉s</s>

homonym markup

⒈ U+2488 Digit One Full Stop replaces <hom>1.</hom> ⒉ U+2489 Digit Two Full Stop replaces <hom>2.</hom> ⒊ U+248A Digit Three Full Stop replaces <hom>3.</hom> ⒋ U+248B Digit Four Full Stop ⒌ U+248C Digit Five Full Stop ⒍ U+248D Digit Six Full Stop ⒎ U+248E Digit Seven Full Stop ⒏ U+248F Digit Eight Full Stop

removal of <h>a markup in metaline

Note: This may have impact on a list display feature. Example:

OLD
<L>213.2<pc>1308,1<k1>akalmāṣa<k2>á-kalmāṣa<h>a<e>1
<s>á-kalmāṣa</s> <hom>a</hom> ¦ <lex>mf(<s>ī</s>)n.</lex> spotless, <ls>ŚBr.</ls>

NEW
<L>213.2<pc>1308,1<k1>akalmāṣa<k2>á-kalmāṣa<e>1
<s>á-kalmāṣa</s>, <lex>mf(<s>{ī}</s>)n.</lex>  ¦ spotless, <ls>ŚBr.</ls> //add(1308,1)

Curly bracket in Sanskrit lex ?

Example

<s>dākṣa</s> ¦ <lex>mf(<s>ī</s>)n.</lex>
<s>dākṣa</s>, <lex>mf(<s>{ī}</s>)n.</lex>

//add and //rev

833 matches in 832 lines for "//" in buffer: mw_AB.txt Purpose of this markup? Examples:

    ¦ a ray, sunbeam; //rev(1308,1)
    ¦ night, <ab>ib.</ab> //add(1308,1)

cl. markup

The verb classes markup changed. Example:

OLD:
<s>aṃś</s> ¦ <ab>cl.</ab> 10. <ab>P.</ab> <s>aṃśayati</s>, to divide  .....
NEW:
<s>aṃś</s>, ¦ <cl>cl. ⓾</cl> <ab>P.</ab> <s>aṃśayati</s>, to divide

Special unicode characters ⓵ U+2460 Circled Digit One ... and others in the Unicode Block “Enclosed Alphanumerics” ⓾ U+24FE Double Circled Number 10 Why doubled? Suggest use U+2469 Circled Number 10

Note <cl> tag is new to one.dtd (mw.dtd)

Andhrabharati commented 1 year ago

metalines ending in <e>#X and accompanying <LEND> Note Found 4 similar lines not marked as '; delete'

This is corrected in the current update, to be pushed yet.

tab character before broken bar One exception

This is corrected in the current update, to be pushed yet.

removal of <h>a markup in metaline Note: This may have impact on a list display feature.

Not just the 'a'; I had removed all the letters ('a' to 'r') in such places!! I will give my reason soon about this (in brief, this needs some 'major revision').

Curly bracket in Sanskrit lex ?

These are the word endings, with or without accent marks, that decide the lexical info of the HW. And these might better be 'shown' in a different style than the HW style, so as to differentiate them appropriately.

Some examples are-- image image Here the accent mark is to be removed in the HW!

image image

⓾ U+24FE Double Circled Number 10 Why doubled? Suggest use U+2469 Circled Number 10

This is missed earlier and corrected in the current update, to be pushed yet.

Note <cl> tag is new to one.dtd (mw.dtd)

There are some more tags introduced in my working--

<ar> </ar>
<cl> </cl>
<col> </col>
<div n="cf"/>
<div n="cl"/>
<div n="p"/>
<ety> </ety>
<gk> </gk>
<lang> </lang>
<ln> </ln>
<nt> </nt>
<pcol> </pcol>
<pe> </pe>
<pg> </pg>
<sch> </sch>

Some of these are revisions/extensions of the present tags, and some are newly introduced.

Andhrabharati commented 1 year ago

//add and //rev

833 matches in 832 lines for "//" in buffer: mw_AB.txt Purpose of this markup? Examples:

  ¦ a ray, sunbeam; //rev(1308,1)
  ¦ night, <ab>ib.</ab> //add(1308,1)

image image

Here some revision is made wrt the print text; so the user somehow is to be informed about this change.

image

Just a comparison with the current CDSL display, which is not so 'appealing' as the above style.

------------------------------------

image image

Here some addition is made wrt the print text; so the user somehow is to be informed about this change.

Andhrabharati commented 1 year ago

Pushed the latest update of mw_AB.txt

funderburkjim commented 1 year ago

An additional xml element noticed: cse.

Request a brief usage description for the new xml elements. This description will help in modifying the displays.

funderburkjim commented 1 year ago

Another dash character

A new dash character noticed after commit 371f2da1. 693 matches in 367 lines for "‒" in buffer: mw_AB.txt

This is Figure Dash (General Punctuation) U+2012

This was noticed as an unexpected character within <s>xxx</s> . 81 matches in 49 lines for "<s>[^<]*‒" in buffer: mw_AB.txt

Request @Andhrabharati describe the purpose of using this particular character. How does intended use compare with intended use of — Em Dash U+2014 ?

Andhrabharati commented 1 year ago

I will describe the different tags and characters I have opted to use, during today's session.

For now, just note that I changed <ety> ...</ety> to <cog> ... </cog>

Andhrabharati commented 1 year ago

A new dash character noticed after commit 371f2da. 693 matches in 367 lines for "‒" in buffer: mw_AB.txt

This is Figure Dash (General Punctuation) U+2012

This was noticed as an unexpected character within <s>xxx</s> .

In fact, the double ‒‒ is the Em Dash employed in MW print, as a 'correlative' element. The Em Dash used in mw.txt is more like an En Dash in the MW print. I will prepare the issue on this today.

A single ‒ is more like an En Dash in MW print (to denote a range); I opted to use the Figure Dash in its place (to distinguish the two characters), as I am thinking of replacing the Em Dashes (in the file) with En Dashes once for all, after describing the various discrepancies in an issue.

And of course, at the end, this Em Dash > En Dash shall be further changed as a > hyphen!!

Andhrabharati commented 1 year ago

Changed in my current revision (in addition to vast many other changes)--

"‒‒" (figure-dash twice, U+2012) to Em Dash [between the correlative terms etc.];

"‒" (figure-dash, U+2012) to En Dash [denoting a range of numbers etc.] and

"—" (Em Dash) to a Box Drawing Heavy Horizontal (U+2501) [so that the 'sense' is still retained (JIC needed at sometime), but is shorter and not too distracting than the too wider Em Dash that is used at present]

Also introduced a minus sign ("−", U+2212) at line 775433 [‘20 − 7 <ab>i. e.</ab> 13’]