sanskrit-lexicon / MWS

Monier Monier-Williams, Sir; A Sanskrit-English dictionary. Oxford, 1899
Other
7 stars 5 forks source link

Consistency in <hom> tagging #131

Closed Andhrabharati closed 1 year ago

Andhrabharati commented 2 years ago
  1. In general, the <hom> number (if not with a letter) is followed by a dot and the <hom> letter is without a dot. But the odd-mans are--

<hom>1.</hom> 4706 <hom>1</hom> 70 <hom>2.</hom> 4362 <hom>2</hom> 59 <hom>3.</hom> 882 <hom>3</hom> 16 <hom>4.</hom> 226 <hom>4</hom> 3 <hom>5.</hom> 126 <hom>5</hom> 12 <hom>6.</hom> 16 <hom>6</hom> 1

<hom>a</hom> 5179 <hom>a.</hom> 62 <hom>b</hom> 4963 <hom>b.</hom> 55 <hom>c</hom> 144 <hom>c.</hom> 6

  1. And the <hom> tag placements wrt the Skt. word are--

"<hom>[0-9](.*?)</hom> <s>" 9579 "<hom>[0-9](.*?)</hom><s>" 1 "</s> <hom>[0-9]" 227

"</s> <hom>[a-z]" 10317 "<hom>[a-z](.*?)> <s>" 1788 "<hom>[a-z](.*?)><s>" 4

Should these not be made consistent throughout? [The <hom> number to precede the <s> word (with a space) & the <hom> letter to follow the <s> word (with a space).]

--------------- [Side-note: There are 11 "<s>" and 1 "</s>" occurrences.]

Andhrabharati commented 2 years ago

One interesting observation:

The <hom>letter tags 'a', 'b' etc. are introduced additionally in the text, which are NOT in the print, for some reason.

As a minimum, the 'a' and 'b' counts should be equal, i.e. every 'a' should have associated 'b' somewhere in the text; otherwise having an isolated 'a' has no meaning. (Having lesser numbers for the next letters is understandable.)

But the counts are different- image

Resolving this, to achieve no count differences would be a good exercise indeed. -------------------------- And, the same condition (equal numbers) applies to <hom>1 (total counts: and <hom>2 as well, the counts being as below-

image

funderburkjim commented 2 years ago

some progress

Some of the inconsistencies pointed out above have been addressed.

(158) <hom>n</hom> -> <hom>n.</hom>  (n a digit, 1 to 6)
(123) <hom>n\.</hom> -> <hom>n</hom>  (n a lower-case letter , [a-c]
(   1) </hom><s> -> </hom> <s>
( 31) in the 'headline' (line after metaline), numeric hom precedes headword
   <s>X</s> <hom>n.</hom> ¦   -> <hom>n.</hom> <s>X</s>  ¦ 
( 15) numeric hom to precede Sanskrit after = 
   = <s>X</s> <hom>n.</hom>   -> = <hom>n.</hom> <s>X</s>
NOTE:  There remain 47 cases of form '<s>X</s> <hom>n.</hom>', but these
   require no changes since the form is always
   '<s>X</s> <hom>n.</hom> <s>Y</s>'

number-letter homs

There were several (about 140) cases of either <h>[0-9][a-z] or `[0-9][a-z]' After changes there remain 66 of each.

funderburkjim commented 2 years ago

homonyms in the list display

The dictionary identifies for about 5700 entries (according to mw.txt digitization). Sometimes the homonym variants appear 'next to' or 'near' each other in the dictionary ordering. But often they appear some distance apart, and it is sometimes useful to navigate among the homonyms in the list display to see the dictionary context of the variants. The 'arrows' in the list display provide this functionality. image For instance, clicking on the 'yellow' arrow will change the left-hand list pane to be centered at that second homonym

image

funderburkjim commented 2 years ago

why 'letter' homonyms ?

In MW, a given headword can appear at different dictionary locations, yet these different locations are not identified by the author as homonym variants.

Letter homonyms were 'invented' by Peter Scharf to permit the list display navigation to these otherwise unmarked entries. Since MW's printed homonym codes are numeric, there is no confusion between MW's printed homonyms and the synthetic letter homonyms.

The list display homonym navigation feature uses only the metaline <h>X values. The <hom>X</hom> code is unused by the navigation, but is used to style the homonym value X (red color).

Here is an example: image

funderburkjim commented 2 years ago

letter homonym in metaline

The <h>[a-z] codes were originally assigned by a Java program developed by Pawan Goyal, working with Peter. And I have adapted this to the current format of mw.txt.

This display feature makes use of the <h> element in the metaline of entries. 10645 matches for "<h>[a-z]" in buffer: temp_mw_1.txt.

I don't recall whether Pawan's original work introduced the few (100+) 'letter-number' h-values, or whether that was done by me at some time.

letter homonym in entry body

As I recall, it was my choice to put the <hom>X</hom> (X a letter) into entry text. And I must have for some reason chosen to put the letter homs AFTER the headword, such as <s>X</s><hom>Y</hom> ¦

My current view is that I should have

I have not made the changes just suggested in the 'I should have...' items.

Request comment by others as to whether this should be done.

Even if the suggested change is made, there are still cases where it is difficult to know how to add markup. For example, in 'or' groups, where the second word has a homonym designation.

funderburkjim commented 2 years ago

I am sure there is further work to be done regarding homonyms in mw, but will put this aside for now as the situation is still unclear to me in several aspects.

gasyoun commented 2 years ago

By contrast, the empty tag is interpreted as being purely markup.

Makes sense.

Andhrabharati commented 1 year ago

I have "handled" the matter in my current full review, and thus this issue can be closed now.