Open funderburkjim opened 2 months ago
@funderburkjim
JIC if you are convinced to make new metalines for the 500+ cases with [ABCE]
e-tags, I'd like to remind you of another related point as mentioned here, namely
The body of many and/or groups was constructed by merging formerly separate entries. Some headwords were 'lost' -- e.g. in the following, the 'karvarI' is now not a (searchable) headword.
This is a flaw in a sense, and we have to bring back all such 'lost' headwords by adopting some 'logic'. In fact, we should extend the "group structure" to all these siblings also, i.e. to "inherit" from the primary (initial) group entry.
i.e., at some of these lexical-change entries, the grouping structure also has to be provided.
Work is here.
My local name for your file is temp_rev1_ab_orig.txt. For conversion to cdsl form, several changes in the orig file were required. This is the revised AB file: temp_rev1_ab_iast.zip
change_notes_orig_iast.txt are my notes on these changes. File diff_orig_iast.txt is the simple diff expression of these changes
The formal changes are mostly to enforce two internal consistencies:
@Andhrabharati I suggest your further changes should be based on this temp_rev1_ab_iast.txt file.
I have not yet thought about the 'body' changes (such as '[]') and removal of the <div n="P"/>
.
Will consider those next.
Note that the [NEW] lines are ignored in the conversion. Similarly ignored are any lines between
<LEND>
line at end of one entry and <L>
line of next entry.
OK so far?
Work is in rev1a directory. Revised AB file: temp_rev1a_ab_iast.zip
About 25 changes, mostly L-ordering. L-nums must be not only unique, but also ordered sequentially. See change_notes_rev1a.txt for the rev1-rev1a changes.
From AB's comment here:
Refer The majority of differences (over 500) are due to the splitting the text at semi-colon (whether as a lexical-change, or a meaning sense change), which has been the practice in the MW. In the recent corrections, Jim had adopted a different
form, which I thought should be changed to match with the rest. I have marked these splits with '[]', leading to change in no. of lines. So this should be the first point of adjustment; after which the 'real' changes could be identified in my alignment process.
524 matches for "<div n="P"/>" in buffer: temp_mw_17_ab1.txt
1127 matches for "\[\]" in buffer: temp_rev1a_cdsl.txt
Sample at first instance:
the <div n="P"/> form
<L>450<pc>3,1<k1>akzAgrakIla<k2>akzAgra-kIla<e>3
<s>akzA<srs/>gra-kIla</s> or <s>akzA<srs/>gra-kIlaka</s>, ¦ <lex>m.</lex> a linch-pin
<div n="P"/> the pin fastening the yoke to the pole.<info lex="m"/>
<LEND>
the [] form
<L>450<pc>3,1<k1>akzAgrakIla<k2>akzAgra-kIla<e>3
<s>akzA<srs/>gra-kIla</s> or <s>akzA<srs/>gra-kIlaka</s>, ¦ <lex>m.</lex> a linch-pin
[]
[]
¦ the pin fastening the yoke to the pole.<info lex="m"/>
<LEND>
Display using \
Display using []:
Clearly, the current [] display form is undesirable.
@Andhrabharati The div markup seems preferable. And is similar in function to other <div n="X"/>
markup in mw (and other dictionaries). So Why not use the div markup here? Somehow displays need to know to introduce a line break (or as in this case a paragraph break). I suppose the program which changes mw.txt to mw.xml could convert each [] to <br/>
and the html would then look the same. But this seems inelegant, and semantically deficient, since this use of div essentially is identifying a preceding ';' as a significant break.
I hope you will reconsider this point.
Are these special unicode characters a temporary experiment? Otherwise need for explanation of special purpose characters like these.
11 matches in 3 lines for "[⓶✿🠚✦]" in buffer: temp_rev1a_ab_iast.txt
8001:<s>aṭavika</s>, better <s>āṭavika</s>, ¦ <s>as</s>, <lex>m.</lex> 🠚⧫¦ a woodman, forester.
542451:⓶✿<s>mīḍhu</s>, ✿<s>mīḻhú</s>, 🠚✦ <lex>m.</lex> 🠚⧫¦ = <s>dhana</s>, <ls>Naigh. ii, 10</ls>.
552364:⓶✿<s>meṭhi</s>, 🠚✦ ⇨⧫¦ See <s>methi</s>.
random question
Are these special unicode characters a temporary experiment? Otherwise need for explanation of special purpose characters like these.
11 matches in 3 lines for "[⓶✿🠚✦]" in buffer: temp_rev1a_ab_iast.txt 8001:<s>aṭavika</s>, better <s>āṭavika</s>, ¦ <s>as</s>, <lex>m.</lex> 🠚⧫¦ a woodman, forester. 542451:⓶✿<s>mīḍhu</s>, ✿<s>mīḻhú</s>, 🠚✦ <lex>m.</lex> 🠚⧫¦ = <s>dhana</s>, <ls>Naigh. ii, 10</ls>. 552364:⓶✿<s>meṭhi</s>, 🠚✦ ⇨⧫¦ See <s>methi</s>.
Yes, they serve a spl. purpose in my revision version; it was my mistake to copy those lines from my file, instead of taking them from the earlier cdsl file.
For now, these are to be taken as
> 8001:<s>aṭavika</s>, better <s>āṭavika</s>, ¦ <s>as</s>, <lex>m.</lex> a woodman, forester.
> 542451:<s>mīḍhu</s>, <s>mīḻhú</s>, ¦ <lex>m.</lex> = <s>dhana</s>, <ls>Naigh. ii, 10</ls>.
> 552364:<s>meṭhi</s>, ¦ See <s>methi</s>.
The formal changes are mostly to enforce two internal consistencies:
1. same number of k2 alternates as {{L}} in LEND line 2. no duplicate L
I agree for these, and it was my error having violated these "rules" in my posted file.
[Rather, I was taking that Jim would "handle" these as appropriate, based on his post, namely
If you move such a grouped entry, then ideally you would change not only the L of the grouped entry, but also the L1,L2. This is because I prefer to use fixed L rather than dynamic L.
However, it is fine with me if you ignore this adjustment. Since L1 should always be same as L, I can detect that {{L1,L2}} needs revision and do the necessary when I convert ab2 form back to cdsl form.
As such, I did not pay not much attention to these.]
These are definitely to be changed/corrected in my file.
Looked at the differences between the files "temp_rev1_ab_orig.txt" (my file, as renamed by Jim) and "temp_rev1_ab_iast.txt" (Jim's revised file), and except at two lines I have corrected all other lines mentioned in Jim's revision.
And these two are given below, with my comments--
(20409)
AB: <LEND>{{5875, 5875.1}}<info lex="m:f#A:n"/>
Jim: <LEND>{{5875, 5875.1, 5875.2, 5875.3}}<info lex="m:f#A:n"/>
AB Remark: This involves a lex-change entry (mfn. to n.), as such should be split as two entries, one with {{5875, 5875.1}}<info lex="m:f#A:n"/>
and the other with {{5875.2, 5875.3}}<info lex="n"/>
(22691)
AB: ¦ the nasal mark <s> ँ</s>
Jim: ¦ the nasal mark ँ
AB remark: I prefer having this tagged as a sanskrit mark, and not as a plain symbol; there is yet another one identifed in my further work at L-16252, namely <s>ᳲ</s>
.
About 25 changes, mostly L-ordering. L-nums must be not only unique, but also ordered sequentially.
Changed all as suggested, but I have an issue at one entry, as given below--
<LEND>{{95074, 95074.05}}<info lex="m:f:n"/><info hui="a"/>
AB Remark: I have changed the numbering here to suit the first meaning alone, as suggested; but this grouping should be extended till L-95075, encompassing all the lex-change (mfn. -> m. -> f. -> n. -> ind.) and meaning sense-change entries (to be with <info lex ="inh"/>
) in-between. But how to mark this whole set of entries thus (as a group)?
Now, coming to the two other posts 1 and 2 by Jim, I can only say that he has grossly mistaken/misunderstood/misinterpreted my point and considered the [] as a new line-break, which is not at all what I meant.
What I was saying in my earlier posts at the other issues and the above one in this issue is that these [][] lines are to be made as entry-terminator <LEND><info xxx>
followed by a new entry metaline <L>...<e>[ABCE]
, as they are having a lexical change or meaning change (or a cf. string) in-between that were marked in the manner throughout the file, except at these 500+ places, which are having the <div n="P"/>
marking.
[I was just looking/craving for consistency of process/style within the file data.]
I think, I need not elaborate this point further.
I hope you will reconsider this point.
It is for you, dear Jim, to re-look at my earlier (and the above) posts, and make the mw data in "uniform form".
Please refer:
I think @Andhrabharati would agree with me that either display is a useful representation of the scan. The only minor flaw I see is that there is a missing semicolon.
When comparing the two displays, the only difference is that the pre-div form has separate IDs (L-numbers) for the two senses. I view this difference as immaterial.
I view the cdsl dictionaries as a kind of search engine. As I understand search engines (such as Elasticsearch, based on the Java Lucene project), there are two components to a search engine.
Why are there 'ABCE' in the \<L>453<pc>3,1<k1>akzAgrakIlaka<k2>akzAgra-kIlaka<e>3A
MW has the '3' -- that refers to his 4-lines. Conceptually it could be part of the document.
But the 'A' is purely an artifice introduced by me at some early stage of the development. 'A' means that the headword and lexical category is same as for the first headword for the document. In other words, the 'document' was split up into 'sub-documents'.
In retrospect, I think this was a wrong choice.
Why did I 'merge' the bodies of L=450 and L=451 of the 'and/or' form i I did this so that there would be ONLY ONE DOCUMENT for the alternate headword to refer to.
Why did I put the \
markup before the second sense? Simply to recognize that this was a second sense, as indicated by the semicolon in the scan.I similarly merged 'B'-sub-entries and others that were encountered among and/or group work.
The work done in issue 175 was focused only on the and-or groups. There are of course many A,B, sub-entries that have not been div-merged. e.g.
<L>441<pc>3,1<k1>akzapIqa<k2>akza—pIqa<e>3
<s>akza—pIqa</s> ¦ <lex>m.</lex> <bot>Chrysopogon Acicularis</bot>, <ls>Suśr.</ls><info lex="m"/>
<LEND>
<L>442<pc>3,1<k1>akzapIqA<k2>akza—pIqA<e>3B
<s>akza—pIqA</s> ¦ <lex>f.</lex> <ab>N.</ab> of a plant.<info lex="f"/>
<LEND>
These two could be merged as
<L>441<pc>3,1<k1>akzapIqa<k2>akza—pIqa<e>3
<s>akza—pIqa</s> ¦ <lex>m.</lex> <bot>Chrysopogon Acicularis</bot>, <ls>Suśr.</ls>;
<s>akza—pIqA</s> ¦ <lex>f.</lex> <ab>N.</ab> of a plant.<info lex="m"/><info lex="f"/>
<LEND>
<L>442<pc>3,1<k1>akzapIqA<k2>akza—pIqA<e>3
{{Lbody=441}}
<LEND>
So the same formalism can readily accommodate feminine forms. The 'phw' instances could also be handled by this same formalism.
I present the above comments as further 'defense' of the \
markup introduced in issue175. I hope it furthers the dialogue with AB.AB:
{{5875, 5875.1}} Jim: {{5875, 5875.1, 5875.2, 5875.3}} AB Remark: This involves a lex-change entry (mfn. to n.), as such should be split as two entries, one with {{5875, 5875.1}} and the other with {{5875.2, 5875.3}}
(I think you meant lex="ind").
The 'document' encompasses both the mfn forms and the ind. forms.
We happen to have 4 alternate headwords that lead to this document.
Hence, <LEND>{{5875, 5875.1, 5875.2, 5875.3}}
.
AB: ¦ the nasal mark
ँJim: ¦ the nasal mark ँ AB remark: I prefer having this tagged as a sanskrit mark, and not as a plain symbol; there is yet another one identified in my further work at L-16252, namelyᳲ.
Refer Peter Scharf's website: https://sanskritlibrary.org/transcodeText.html
Re L=6522,6523 ँ anunAsika U+0901 Devanagari Sign Candrabindu
slp1 is ~, Roman (iast) is ~.
So in your iast file you could write <s>~</s>
Re L=16252 ᳲ arDavisarga U+1CF2 Vedic Sign Ardhavisarga
slp1 is Z and Roman (iast) is ẖ (unicode u+0068u+0331 (combining macron below)
So in your iast file you could write <s>ẖ</s>
.
I can adjust cdsl transcoding files accordingly.
LEND>{{95074, 95074.05}}
AB Remark: I have changed the numbering here to suit the first meaning alone, as suggested; but this grouping should be extended till L-95075, encompassing all the lex-change (mfn. -> m. -> f. -> n. -> ind.) and meaning sense-change entries (to be with ) in-between. But how to mark this whole set of entries thus (as a group)?
For Jim's answer to 'how to mark?', compare
further defense of div
I think @Andhrabharati would agree with me that either display is a useful representation of the scan. The only minor flaw I see is that there is a missing semicolon.
When comparing the two displays, the only difference is that the pre-div form has separate IDs (L-numbers) for the two senses. I view this difference as immaterial.
I have no issues with this!
Why are there 'ABCE' in the
field of metaline? ... ... But the 'A' is purely an artifice introduced by me at some early stage of the development. ... ... In retrospect, I think this was a wrong choice.
With exactly the same view, I had completely got rid of these A-form metalines (that have <info lex n="inh"/>
) in my revision file, and marked them with a <div/>
splitting at meaning sense-changes; in contrast, Jim has removed these in just those grouped entries that are marked so far (I have more than 4000 groups that are yet to be marked thus in the CDSL file!).
I similarly merged 'B'-sub-entries and others that were encountered among and/or group work.
I have chosen to make new entries for all such (by appropriately padding the terminations given in braces in print), thus getting more HWs that could be 'directly' searched for; in the method adopted by Jim, those would not be 'searchable'!
I have also considered grouping-inheritance (as appropriate; not all those lexical-siblings are with group-inheritance!) to these lexical-siblings (as I coined this term, and marked them with Ⓛ); these grouped-siblings are also 'out-of-searchability' in Jim's version.
I present the above comments as further 'defense' of the
markup introduced in issue175. I hope it furthers the dialogue with AB.
My only point above was to make the full file in uniform and consistent style; but Jim has opted to limit the process just to these grouped entries, which is a miniscule portion of the whole. So, instead of asking him to continue the same to the whole rest of the text, thought it was convenient and easier for him to revert those 500+ cases to metaline form.
Anyways, I have no interest debating further on this; but I would continue with my marking (which I think is in a better form).
Jim can simply delete those [] lines and look at the differences in the rest of my file data and take action in the cdsl file (in a manner that he feels appropriate).
And this issue can be closed now, as no probable discussion/action is envisaged further.
I would continue with my marking (which I think is in a better form).
Sounds like a good idea. When you conclude your marking of the annexure placements, do you plan to upload it ?
YES!
Need your stand on this, Jim!
Would you like to go with the marking of rev-entries in the GRA style (with original and revised strings together side-by-side), or to go in a simpler manner that you did some of the MW revi-entries [just marking as <info n="rev"/>
, not even having the (p,c) reference as in case of the sup-entries]?
Need your stand
example_rev_sup.txt has some proposed coding. ('current' refers to temp_rev1a_ab_iast.txt).
in the following, the 'karvarI' is now not a (searchable) headword.
karvarI has been added as an alternate headword, - check MW display on cologne server
Love you both, @Andhrabharati and @funderburkjim
This issue continues the discussion begun in #176.
More specifically, the first objective is to examine @Andhrabharati 's version at this comment.