sanskrit-lexicon / mw-dev

Development version of MW dictionary, to collaborate with Andhrabharati
1 stars 0 forks source link

MW full-review-002: Issues wrt vowel-sandhi marker (<srs/>) #3

Closed Andhrabharati closed 1 year ago

Andhrabharati commented 1 year ago

Obs.1-a There are 141 instances of <srs/> </s>, which could be changed as <srs/></s>, and then

Obs.1-b There are 62 instances of <srs/> <s>, which could be changed as <srs/><s>, and then

Obs.1-c There are 65 instances of </s> <srs/>, which could be changed as </s><srs/>, and then

Obs.1-d There are 62 instances of </s><srs/><s>, which could be changed as <srs/>.

Now we reach the inconsistencies: case-A: Though a vast majority (~38K) of vowel-sandhi cases are marked in the mw.txt, observed that few thousands more are not marked with <srs/>.

case-B: If it is to maintain the similarity wrt the printed matter, all 4-types of vowel-sandhi should be marked; having just one type results in ambiguity at some places.

So, option-1: do we remove the <srs/> markers altogether (easiest solution)? [In <ls> strings, this (as a circumflex) is already removed!!] [Now, we have tools and means to identify constituent words in a composite word, and having the vowel-sandhi marker(s) is not so essential.]

option-2: or, make them unambiguous by rendering in 4-types, and also fill the 'missed' places (which needs some effort)? [We get the text close to print, if some means of displaying these markers is thought of (in future); the single <srs/> marker is also invisible to the end-user as of now!]

funderburkjim commented 1 year ago

I can't find any instances of Obs.1-a, Obs.1-b, Obs.1-c. These patterns are not found in current mw.txt (csl-orig/v02/mw/mw.txt).

I do find the 62 case of Obs.1-d, and agree with the suggested change.

funderburkjim commented 1 year ago

we have tools and means to identify constituent words in a composite word

Where are these tools?

The <srs/> markup was used by me in https://github.com/funderburkjim/MWderivations in relation to parsing compounds.

As you notice, <srs/> markup is invisible in all displays.
If <srs/> is dropped, that will be an information loss. But perhaps, somewhat similar to the dropping of <info lex=".."/> , this information loss is not deemed important in your work.

My opinion -- just drop <srs/> in your work (easiest solution).

funderburkjim commented 1 year ago

Obs.1-d change made in cdsl mw. See commit link above.

Andhrabharati commented 1 year ago

@funderburkjim I had decided to keep the marking, but in another form(!!); this could be displayed in a different color to indicate the vowel-sandhi character(s), with or without accents.

This is the summary: No accent:

ā<srs/> 〈ā〉 27893
ī<srs/> 〈ī〉 1018
ū<srs/> 〈ū〉 438
e<srs/> 〈e〉 3121
ai<srs/>    〈ai〉    372
o<srs/> 〈o〉 4584
au<srs/>    〈au〉    319

Acute accent:

ā́<srs/>    〈ā́〉    293
ī́<srs/>    〈ī́〉    26
ū́<srs/>    〈ū́〉    8
é<srs/>    〈é〉    60
aí<srs/>   〈aí〉   9
ó<srs/>    〈ó〉    55
aú<srs/>   〈aú〉   9

Grave accent: ī̀<srs/> 〈ī̀〉 4

Also, I would be marking the left-overs (or missed ones) once I start proofing the full text.

Andhrabharati commented 1 year ago

I have also noticed 3 <srs/> instances outside the <s>...</s> string (lines 37165, 142165 and 146704), and one instance of repeating <srs/><srs/> (line 724014).

funderburkjim commented 1 year ago

What is the 'other form'?

Note: in mwderivations, I found it convenient to replace <srs/> by a single character not appearing elsewhere in mw.txt; I chose the '@' character

Andhrabharati commented 1 year ago

What is the 'other form'?

〈...〉 [if the left character is removed and the right character replaced with <srs/>, we get the cdsl form]

And this facilitates non-obtrusive reading for me.

funderburkjim commented 1 year ago

Don't quite understand the other form. Example of OLD/NEW line?

Andhrabharati commented 1 year ago

image

funderburkjim commented 1 year ago

Thanks -- now I get it.

Andhrabharati commented 1 year ago

Now, this issue is closable.

funderburkjim commented 1 year ago

changes at lines 37165, etc. modifed in cdsl mw.txt.

Accent changes at these two 142165, 146704