Closed Andhrabharati closed 1 year ago
I can't find any instances of Obs.1-a, Obs.1-b, Obs.1-c. These patterns are not found in current mw.txt (csl-orig/v02/mw/mw.txt).
I do find the 62 case of Obs.1-d, and agree with the suggested change.
we have tools and means to identify constituent words in a composite word
Where are these tools?
The <srs/>
markup was used by me in https://github.com/funderburkjim/MWderivations
in relation to parsing compounds.
As you notice, <srs/>
markup is invisible in all displays.
If <srs/>
is dropped, that will be an information loss. But perhaps, somewhat similar to the dropping of <info lex=".."/>
, this information loss is not deemed important in your work.
My opinion -- just drop <srs/>
in your work (easiest solution).
Obs.1-d change made in cdsl mw. See commit link above.
@funderburkjim I had decided to keep the marking, but in another form(!!); this could be displayed in a different color to indicate the vowel-sandhi character(s), with or without accents.
This is the summary: No accent:
ā<srs/> 〈ā〉 27893
ī<srs/> 〈ī〉 1018
ū<srs/> 〈ū〉 438
e<srs/> 〈e〉 3121
ai<srs/> 〈ai〉 372
o<srs/> 〈o〉 4584
au<srs/> 〈au〉 319
Acute accent:
ā́<srs/> 〈ā́〉 293
ī́<srs/> 〈ī́〉 26
ū́<srs/> 〈ū́〉 8
é<srs/> 〈é〉 60
aí<srs/> 〈aí〉 9
ó<srs/> 〈ó〉 55
aú<srs/> 〈aú〉 9
Grave accent:
ī̀<srs/> 〈ī̀〉 4
Also, I would be marking the left-overs (or missed ones) once I start proofing the full text.
I have also noticed 3 <srs/>
instances outside the <s>...</s>
string (lines 37165, 142165 and 146704), and one instance of repeating <srs/><srs/>
(line 724014).
What is the 'other form'?
Note: in mwderivations, I found it convenient to replace <srs/>
by a single character not appearing elsewhere in mw.txt; I chose the '@' character
What is the 'other form'?
〈...〉
[if the left character is removed and the right character replaced with <srs/>
, we get the cdsl form]
And this facilitates non-obtrusive reading for me.
Don't quite understand the other form. Example of OLD/NEW line?
Thanks -- now I get it.
Now, this issue is closable.
changes at lines 37165, etc. modifed in cdsl mw.txt.
Accent changes at these two 142165, 146704
Obs.1-a There are 141 instances of
<srs/> </s>
, which could be changed as<srs/></s>
, and thenObs.1-b There are 62 instances of
<srs/> <s>
, which could be changed as<srs/><s>
, and thenObs.1-c There are 65 instances of
</s> <srs/>
, which could be changed as</s><srs/>
, and thenObs.1-d There are 62 instances of
</s><srs/><s>
, which could be changed as<srs/>
.Now we reach the inconsistencies: case-A: Though a vast majority (~38K) of vowel-sandhi cases are marked in the mw.txt, observed that few thousands more are not marked with
<srs/>
.case-B: If it is to maintain the similarity wrt the printed matter, all 4-types of vowel-sandhi should be marked; having just one type results in ambiguity at some places.
So, option-1: do we remove the
<srs/>
markers altogether (easiest solution)? [In<ls>
strings, this (as a circumflex) is already removed!!] [Now, we have tools and means to identify constituent words in a composite word, and having the vowel-sandhi marker(s) is not so essential.]option-2: or, make them unambiguous by rendering in 4-types, and also fill the 'missed' places (which needs some effort)? [We get the text close to print, if some means of displaying these markers is thought of (in future); the single
<srs/>
marker is also invisible to the end-user as of now!]