AB3 alternate form for mws

This issue continues the discussion begun in #176.

More specifically, the first objective is to examine @Andhrabharati 's version at this comment.

@funderburkjim

JIC if you are convinced to make new metalines for the 500+ cases with [ABCE] e-tags, I'd like to remind you of another related point as mentioned here, namely

The body of many and/or groups was constructed by merging formerly separate entries.
Some headwords were 'lost' -- e.g. in the following, the 'karvarI' is now not a (searchable) headword.
This is a flaw in a sense, and we have to bring back all such 'lost' headwords by adopting some 'logic'. In fact, we should extend the "group structure" to all these siblings also, i.e. to "inherit" from the primary (initial) group entry.

i.e., at some of these lexical-change entries, the grouping structure also has to be provided.

formal review

Work is here.

My local name for your file is temp_rev1_ab_orig.txt. For conversion to cdsl form, several changes in the orig file were required. This is the revised AB file: temp_rev1_ab_iast.zip

change_notes_orig_iast.txt are my notes on these changes. File diff_orig_iast.txt is the simple diff expression of these changes

The formal changes are mostly to enforce two internal consistencies:

same number of k2 alternates as {{L}} in LEND line
no duplicate L

@Andhrabharati I suggest your further changes should be based on this temp_rev1_ab_iast.txt file.

I have not yet thought about the 'body' changes (such as '[]') and removal of the <div n="P"/>. Will consider those next.

Note that the [NEW] lines are ignored in the conversion. Similarly ignored are any lines between <LEND> line at end of one entry and <L> line of next entry.

OK so far?

rev1a

Work is in rev1a directory. Revised AB file: temp_rev1a_ab_iast.zip

About 25 changes, mostly L-ordering. L-nums must be not only unique, but also ordered sequentially. See change_notes_rev1a.txt for the rev1-rev1a changes.

comment on [][] vs. \< div n="P"/>

From AB's comment here:

Refer The majority of differences (over 500) are due to the splitting the text at semi-colon (whether as a lexical-change, or a meaning sense change), which has been the practice in the MW. In the recent corrections, Jim had adopted a different
form, which I thought should be changed to match with the rest. I have marked these splits with '[]', leading to change in no. of lines. So this should be the first point of adjustment; after which the 'real' changes could be identified in my alignment process.

524 matches for "<div n="P"/>" in buffer: temp_mw_17_ab1.txt 1127 matches for "\[\]" in buffer: temp_rev1a_cdsl.txt

Sample at first instance:

the <div n="P"/> form
<L>450<pc>3,1<k1>akzAgrakIla<k2>akzAgra-kIla<e>3
<s>akzA<srs/>gra-kIla</s> or <s>akzA<srs/>gra-kIlaka</s>, ¦ <lex>m.</lex> a linch-pin
<div n="P"/> the pin fastening the yoke to the pole.<info lex="m"/>
<LEND>

the [] form
<L>450<pc>3,1<k1>akzAgrakIla<k2>akzAgra-kIla<e>3
<s>akzA<srs/>gra-kIla</s> or <s>akzA<srs/>gra-kIlaka</s>, ¦ <lex>m.</lex> a linch-pin
[]
[]
¦ the pin fastening the yoke to the pole.<info lex="m"/>
<LEND>

Display using \

Display using []:

Clearly, the current [] display form is undesirable.

@Andhrabharati The div markup seems preferable. And is similar in function to other <div n="X"/> markup in mw (and other dictionaries). So Why not use the div markup here? Somehow displays need to know to introduce a line break (or as in this case a paragraph break). I suppose the program which changes mw.txt to mw.xml could convert each [] to <br/> and the html would then look the same. But this seems inelegant, and semantically deficient, since this use of div essentially is identifying a preceding ';' as a significant break.

I hope you will reconsider this point.

random question

Are these special unicode characters a temporary experiment? Otherwise need for explanation of special purpose characters like these.

11 matches in 3 lines for "[⓶✿🠚✦]" in buffer: temp_rev1a_ab_iast.txt
   8001:<s>aṭavika</s>, better <s>āṭavika</s>, ¦ <s>as</s>, <lex>m.</lex> 🠚⧫¦ a woodman, forester.
 542451:⓶✿<s>mīḍhu</s>, ✿<s>mīḻhú</s>, 🠚✦ <lex>m.</lex> 🠚⧫¦ = <s>dhana</s>, <ls>Naigh. ii, 10</ls>.
 552364:⓶✿<s>meṭhi</s>, 🠚✦ ⇨⧫¦ See <s>methi</s>.

random question

Are these special unicode characters a temporary experiment? Otherwise need for explanation of special purpose characters like these.

11 matches in 3 lines for "[⓶✿🠚✦]" in buffer: temp_rev1a_ab_iast.txt
   8001:<s>aṭavika</s>, better <s>āṭavika</s>, ¦ <s>as</s>, <lex>m.</lex> 🠚⧫¦ a woodman, forester.
 542451:⓶✿<s>mīḍhu</s>, ✿<s>mīḻhú</s>, 🠚✦ <lex>m.</lex> 🠚⧫¦ = <s>dhana</s>, <ls>Naigh. ii, 10</ls>.
 552364:⓶✿<s>meṭhi</s>, 🠚✦ ⇨⧫¦ See <s>methi</s>.

Yes, they serve a spl. purpose in my revision version; it was my mistake to copy those lines from my file, instead of taking them from the earlier cdsl file.

For now, these are to be taken as

>    8001:<s>aṭavika</s>, better <s>āṭavika</s>, ¦ <s>as</s>, <lex>m.</lex> a woodman, forester.
>  542451:<s>mīḍhu</s>, <s>mīḻhú</s>, ¦ <lex>m.</lex> = <s>dhana</s>, <ls>Naigh. ii, 10</ls>.
>  552364:<s>meṭhi</s>, ¦ See <s>methi</s>.

The formal changes are mostly to enforce two internal consistencies:
1. same number of k2 alternates as {{L}} in LEND line

2. no duplicate L

I agree for these, and it was my error having violated these "rules" in my posted file.

[Rather, I was taking that Jim would "handle" these as appropriate, based on his post, namely

If you move such a grouped entry, then ideally you would change not only the L of the grouped entry, but also the L1,L2. This is because I prefer to use fixed L rather than dynamic L.

However, it is fine with me if you ignore this adjustment. Since L1 should always be same as L, I can detect that {{L1,L2}} needs revision and do the necessary when I convert ab2 form back to cdsl form.

As such, I did not pay not much attention to these.]

These are definitely to be changed/corrected in my file.

Looked at the differences between the files "temp_rev1_ab_orig.txt" (my file, as renamed by Jim) and "temp_rev1_ab_iast.txt" (Jim's revised file), and except at two lines I have corrected all other lines mentioned in Jim's revision.

And these two are given below, with my comments-- (20409) AB: <LEND>{{5875, 5875.1}}<info lex="m:f#A:n"/> Jim: <LEND>{{5875, 5875.1, 5875.2, 5875.3}}<info lex="m:f#A:n"/> AB Remark: This involves a lex-change entry (mfn. to n.), as such should be split as two entries, one with {{5875, 5875.1}}<info lex="m:f#A:n"/> and the other with {{5875.2, 5875.3}}<info lex="n"/>

(22691) AB: ¦ the nasal mark <s> ँ</s> Jim: ¦ the nasal mark ँ AB remark: I prefer having this tagged as a sanskrit mark, and not as a plain symbol; there is yet another one identifed in my further work at L-16252, namely <s>ᳲ</s>.

About 25 changes, mostly L-ordering. L-nums must be not only unique, but also ordered sequentially.

Changed all as suggested, but I have an issue at one entry, as given below--

<LEND>{{95074, 95074.05}}<info lex="m:f:n"/><info hui="a"/>

AB Remark: I have changed the numbering here to suit the first meaning alone, as suggested; but this grouping should be extended till L-95075, encompassing all the lex-change (mfn. -> m. -> f. -> n. -> ind.) and meaning sense-change entries (to be with <info lex ="inh"/>) in-between. But how to mark this whole set of entries thus (as a group)?

Now, coming to the two other posts 1 and 2 by Jim, I can only say that he has grossly mistaken/misunderstood/misinterpreted my point and considered the [] as a new line-break, which is not at all what I meant.

What I was saying in my earlier posts at the other issues and the above one in this issue is that these [][] lines are to be made as entry-terminator <LEND><info xxx> followed by a new entry metaline <L>...<e>[ABCE], as they are having a lexical change or meaning change (or a cf. string) in-between that were marked in the manner throughout the file, except at these 500+ places, which are having the <div n="P"/> marking. [I was just looking/craving for consistency of process/style within the file data.]

I think, I need not elaborate this point further.

I hope you will reconsider this point.

It is for you, dear Jim, to re-look at my earlier (and the above) posts, and make the mw data in "uniform form".

further defense of div

Please refer:

example_akzAgrakIla_group.txt shows
- scanned image of the entry
- the cdsl display 'pre-div'
- the cdsl display using the div markup
akshagrakila.png
- the markup of this group prior to the introduction of div (with 'and/or')
- the markup after introduction of div (but before Lbody consolidation)
- the current cdsl markup, with div and Lbody consolidation

I think @Andhrabharati would agree with me that either display is a useful representation of the scan. The only minor flaw I see is that there is a missing semicolon.

When comparing the two displays, the only difference is that the pre-div form has separate IDs (L-numbers) for the two senses. I view this difference as immaterial.

I view the cdsl dictionaries as a kind of search engine. As I understand search engines (such as Elasticsearch, based on the Java Lucene project), there are two components to a search engine.

the document. In our mw.txt context, a document is the material between the metaline and the LEND marker. The L-number is a 'document id'.
the index, which is a separate data structure used to select a document (or documents) based on a search query. In our mw.txt context, the index is constructed from the k1 field of the metaline.

Why are there 'ABCE' in the \ field of metaline? e.g. 3A <L>453<pc>3,1<k1>akzAgrakIlaka<k2>akzAgra-kIlaka<e>3A

MW has the '3' -- that refers to his 4-lines. Conceptually it could be part of the document.

But the 'A' is purely an artifice introduced by me at some early stage of the development. 'A' means that the headword and lexical category is same as for the first headword for the document. In other words, the 'document' was split up into 'sub-documents'.

In retrospect, I think this was a wrong choice.

Why did I 'merge' the bodies of L=450 and L=451 of the 'and/or' form i I did this so that there would be ONLY ONE DOCUMENT for the alternate headword to refer to.

Why did I put the \

markup before the second sense? Simply to recognize that this was a second sense, as indicated by the semicolon in the scan.
In this example, the semicolon (at the end of line 1) is missing -- this is a correctable omission.

I similarly merged 'B'-sub-entries and others that were encountered among and/or group work.

The work done in issue 175 was focused only on the and-or groups. There are of course many A,B, sub-entries that have not been div-merged. e.g.

<L>441<pc>3,1<k1>akzapIqa<k2>akza—pIqa<e>3
<s>akza—pIqa</s> ¦ <lex>m.</lex> <bot>Chrysopogon Acicularis</bot>, <ls>Suśr.</ls><info lex="m"/>
<LEND>
<L>442<pc>3,1<k1>akzapIqA<k2>akza—pIqA<e>3B
<s>akza—pIqA</s> ¦ <lex>f.</lex> <ab>N.</ab> of a plant.<info lex="f"/>
<LEND>

These two could be merged as

<L>441<pc>3,1<k1>akzapIqa<k2>akza—pIqa<e>3
<s>akza—pIqa</s> ¦ <lex>m.</lex> <bot>Chrysopogon Acicularis</bot>, <ls>Suśr.</ls>;
<s>akza—pIqA</s> ¦ <lex>f.</lex> <ab>N.</ab> of a plant.<info lex="m"/><info lex="f"/>
<LEND>
<L>442<pc>3,1<k1>akzapIqA<k2>akza—pIqA<e>3
{{Lbody=441}}
<LEND>

So the same formalism can readily accommodate feminine forms. The 'phw' instances could also be handled by this same formalism.

I present the above comments as further 'defense' of the \

markup introduced in issue175. I hope it furthers the dialogue with AB.

funderburkjim commented 2 months ago

AB: {{5875, 5875.1}} Jim: {{5875, 5875.1, 5875.2, 5875.3}} AB Remark: This involves a lex-change entry (mfn. to n.), as such should be split as two entries, one with {{5875, 5875.1}} and the other with {{5875.2, 5875.3}}

(I think you meant lex="ind").

The 'document' encompasses both the mfn forms and the ind. forms. We happen to have 4 alternate headwords that lead to this document. Hence, <LEND>{{5875, 5875.1, 5875.2, 5875.3}}.

funderburkjim commented 2 months ago

AB: ¦ the nasal mark ँ Jim: ¦ the nasal mark ँ AB remark: I prefer having this tagged as a sanskrit mark, and not as a plain symbol; there is yet another one identified in my further work at L-16252, namely ᳲ.

Refer Peter Scharf's website: https://sanskritlibrary.org/transcodeText.html

Re L=6522,6523 ँ anunAsika U+0901 Devanagari Sign Candrabindu slp1 is ~, Roman (iast) is ~.
So in your iast file you could write <s>~</s>

Re L=16252 ᳲ arDavisarga U+1CF2 Vedic Sign Ardhavisarga slp1 is Z and Roman (iast) is ẖ (unicode u+0068u+0331 (combining macron below) So in your iast file you could write <s>ẖ</s>.

I can adjust cdsl transcoding files accordingly.

funderburkjim commented 2 months ago

LEND>{{95074, 95074.05}} AB Remark: I have changed the numbering here to suit the first meaning alone, as suggested; but this grouping should be extended till L-95075, encompassing all the lex-change (mfn. -> m. -> f. -> n. -> ind.) and meaning sense-change entries (to be with ) in-between. But how to mark this whole set of entries thus (as a group)?

For Jim's answer to 'how to mark?', compare

example_drdha_1.txt has the current markup in temp_rev1a_ab_iast.txt.
example_drdha_2.txt Jim's rewrite
drdha.png for the corresponding scan

Andhrabharati commented 2 months ago

further defense of div

I think @Andhrabharati would agree with me that either display is a useful representation of the scan. The only minor flaw I see is that there is a missing semicolon.

When comparing the two displays, the only difference is that the pre-div form has separate IDs (L-numbers) for the two senses. I view this difference as immaterial.

I have no issues with this!

Why are there 'ABCE' in the field of metaline? ... ... But the 'A' is purely an artifice introduced by me at some early stage of the development. ... ... In retrospect, I think this was a wrong choice.

With exactly the same view, I had completely got rid of these A-form metalines (that have <info lex n="inh"/>) in my revision file, and marked them with a <div/> splitting at meaning sense-changes; in contrast, Jim has removed these in just those grouped entries that are marked so far (I have more than 4000 groups that are yet to be marked thus in the CDSL file!).

I similarly merged 'B'-sub-entries and others that were encountered among and/or group work.

I have chosen to make new entries for all such (by appropriately padding the terminations given in braces in print), thus getting more HWs that could be 'directly' searched for; in the method adopted by Jim, those would not be 'searchable'!

I have also considered grouping-inheritance (as appropriate; not all those lexical-siblings are with group-inheritance!) to these lexical-siblings (as I coined this term, and marked them with Ⓛ); these grouped-siblings are also 'out-of-searchability' in Jim's version.

I present the above comments as further 'defense' of the
markup introduced in issue175. I hope it furthers the dialogue with AB.

My only point above was to make the full file in uniform and consistent style; but Jim has opted to limit the process just to these grouped entries, which is a miniscule portion of the whole. So, instead of asking him to continue the same to the whole rest of the text, thought it was convenient and easier for him to revert those 500+ cases to metaline form.

Anyways, I have no interest debating further on this; but I would continue with my marking (which I think is in a better form).

Jim can simply delete those [] lines and look at the differences in the rest of my file data and take action in the cdsl file (in a manner that he feels appropriate).

Andhrabharati commented 2 months ago

And this issue can be closed now, as no probable discussion/action is envisaged further.

funderburkjim commented 2 months ago

I would continue with my marking (which I think is in a better form).

Sounds like a good idea. When you conclude your marking of the annexure placements, do you plan to upload it ?

Andhrabharati commented 2 months ago

YES!

Andhrabharati commented 2 months ago

Need your stand on this, Jim!

Would you like to go with the marking of rev-entries in the GRA style (with original and revised strings together side-by-side), or to go in a simpler manner that you did some of the MW revi-entries [just marking as <info n="rev"/>, not even having the (p,c) reference as in case of the sup-entries]?

funderburkjim commented 2 months ago

Need your stand

example_rev_sup.txt has some proposed coding. ('current' refers to temp_rev1a_ab_iast.txt).

revs should have old and new text and the rev pc
sups with brand-new headwords can be left unchanged
sups for 'main' headwords (additional senses) need new 'merging', with additional markup for the pc of the sup.
supplement 'deletes' -- I don't know how to find these -- I suspect they could be handled with the markup of the examples.

funderburkjim commented 2 months ago

in the following, the 'karvarI' is now not a (searchable) headword.

karvarI has been added as an alternate headword, - check MW display on cologne server

gasyoun commented 2 months ago

Love you both, @Andhrabharati and @funderburkjim

sanskrit-lexicon / MWS