Hierarchy placement of supplement entries

funderburkjim commented 3 months ago

6395 matches for "n="sup"" in buffer: mw.txt
248 matches for "n="rev"" in buffer: mw.txt

This issue is started since the referring issue is closed, and this idea deserves attention some time.

@Andhrabharati It seems conceivable that some rule could be developed to identify sup/rev entries that are properly placed.

Andhrabharati commented 3 months ago

Yes, @funderburkjim; we sure can find some "shortcuts" to identify the errors in hierarchy placement of supplement entries.

But, I was thinking of doing a much thorough job, looking at all the suppl. matter (in print) once, as I had identified few cases of missing entries [during my revision process], apart from (a) placement errors and (b) wrong taggings ("sup" instead of "rev", missing the intent of the matter lying in the text.

Probably this could be done in a month or two, if taken up.

Andhrabharati commented 3 months ago

BTW, I am glad that you have now opened up two issues, referring to my other two points posted at csl-orig 1637.

Isn't the 3rd point that I had mentioned 'appealing enough' to you?

Probably we can request @maltenth as well, for his opinion on this.

funderburkjim commented 3 months ago

I presume the 3rd point is

"prepare" another annexure to MW, mainly bringing-in the "missed" entries from PWG & pwk. [Reason: We are now identifying MW (and his team) having missed some entries at times.]

I'm unsure what this might entail. Why don't you open an mws issue and flesh out this idea with some examples to illustrate

a missed entry from pwk, and a similar missed entry from pwg
- What is the 'definition' of a missed entry?
- How many missed entries are there?
How to 'bring in' to mw.txt a missed entry

funderburkjim commented 2 months ago

@Andhrabharati comment in #173 can also be considered in this issue:

BTW, I have seen that quite a few "sup" entries are actually supposed to be "rev" entries. Also some entries are altogether "left" unmarked with either tag.`

I am unsure whether this part of AB's comment has any relevance to MW, or is aimed only at Grassman

Yes, showing Ⓢ (for suppl. addition) and Ⓡ (for suppl. revision) [and Ⓓ (for suppl. deletion) or Ⓔ (for suppl. erase)] entries is very appealing. I would also suggest showing the corresp. Ⓡ/Ⓓ/Ⓔ entry along with the "modified" main entry as done in case of GRA; this would clearly show the exact correction "string" in the text.

Andhrabharati commented 2 months ago

@Andhrabharati comment in #173 can also be considered in this issue:

BTW, I have seen that quite a few "sup" entries are actually supposed to be "rev" entries. Also some entries are altogether "left" unmarked with either tag.`

I am unsure whether this part of AB's comment has any relevance to MW, or is aimed only at Grassman

Yes, showing Ⓢ (for suppl. addition) and Ⓡ (for suppl. revision) [and Ⓓ (for suppl. deletion) or Ⓔ (for suppl. erase)] entries is very appealing. I would also suggest showing the corresp. Ⓡ/Ⓓ/Ⓔ entry along with the "modified" main entry as done in case of GRA; this would clearly show the exact correction "string" in the text.

My posts (as referred above) were very much about MW suppl. data; GRA has been used only as a ref. as to how the tags were employed in the integration process.

Just like to post here (for now) that there were 18 erase entries in the MW suppl. that mention about

erasing the accent or homonym which could be treated (and indeed were marked "rev") as revisions;
erasing some text portion and adding another text portion, which is the way revisions are to be done; and
complete discarding of the meaning portions [4 such entries are present in the whole suppl. pages, that were rendered as individual entries in the earlier data, and got deleted altogether when AB-Jim (as a team) took up the suppl. integration work in 2021.]

I had tracked the earlier working (of 2021) few days back, and noted that the work was halted before completion, for a very 'stray' reason; AB has been waiting for Jim's giving a signal to AB to resume the work (where he had asked AB to stop, as he wanted to do some other [unspecified] work on MW data) and it turned out that Jim had forgotten the point and AB also did not post any reminder (but moved on to various other repositories of CDSL).

funderburkjim commented 2 months ago

L_change_01

This is my first contribution to the hierarchy placement focus of this issue. My supporting work is here.

By some means (see readme of 171 for details), I developed 149 such repositioning cases, and then modified mw.txt accordingly. Three files:

L_change_01.txt contains the mapping from OLD L (entry id) to NEW L. 149 cases identified.
L_context_01_old.txt shows context of the OLD L in the old mw
L_context_01_new.txt shows context of the NEW L in the new mw.

The 'context' is essentially a synopsis of the list-pane in the list display. By examining the old context, one can see why the repositioning was needed. The new context can help evaluate the repositioning.

My opinion is that repositioning is quite tricky, due to such things as

mw sorting oddities (esp. with respect to anusvAra)
alternate-headword placement.
some duplication of 'sup' with 'main' entries.
H-level (H1/2, H3, H4)

Note that only L's with a 'sup' are repositioned. We take as fixed the L's for body of mw.

The context program (L_context.py) is quite general, as is the reordering program (L_order.py). A good way to communicate additional reordering changes is via files like L_change_01.txt (list of old new L).

I suggest that AB (and Scott if he's still interested) develop further L_change.txt files . One systematic way might be by a page (or column) at a time from the supplement.

I think that this repositioning work can and should be completed before undertaking the additional changes that AB mentions (such as the totally deleted records!)

Andhrabharati commented 2 months ago

Firstly, I am glad that Jim started with the oddities in MW entry ordering; I have identified couple of other points as well.

And I wanted to do a thorough 'integration' job starting 'ab initio', and have made the MW suppl. file afresh & noticed that the earlier CDSL files [ADD2b.xml, ADD3.xml & 6602-entries-from-supplements-MW.txt files etc.] have 'missed' some good info leading to addl. entries.

Then, I had framed a plan in mind, but haven't started the work yet.

funderburkjim commented 2 months ago

Material from server

legacy/readme.org Notes by Jim from 2012 regarding the integration of mw supplement into body of mw.
- not as useful as one might hope.
ADD3.xml the supplement records of that initial integration work. 7145 such records. according to the readme.org, this is what the 2012 integration started with.
- this list of entries corresponds to the printed supplement.

gasyoun commented 2 months ago

It's great so @Andhrabharati working so tight together with @funderburkjim, good that the need has got the attention it deserves back.

funderburkjim commented 2 months ago

Phase 2 : legacy1 comparison preparation

The objective here is to prepare a 'diff' between the 'rev/sup' entries of mw.txt and the add3 legacy digitization of the supplement.

Steps are described in issue171/readme, starting at End of phase 1 work on issue171. The programs and data files are in the legacy1 sub-directory.

add3a.txt a conversion from the xml form of legacy/ADD3.xml to the 'metaline' form used in mw.txt
add3b.txt corrects many k1 spelling errors of add3a. There are also a few related changes to the previous version of mw.txt
- add3b_changes.txt documents the changes.
add3c.txt additional changes to the legacy version for purpose of comparison with rev/sup entries in mw.
- add3c_changes.txt documents the changes

changes_mw_02.txt documents the changes made to previous version of mw

At this point, add3c.txt and (local) temp_mw_02.txt are comparable:

There are 6642 entries:
- in add3c comparisons, we exclude (in <ab>comp.</ab>) entries
  - but these excluded entries ARE retained in add3c.
- temp_mw_02 also has 6642 entries (<info n="rev" or <info n="sup".

It is now possible to do a meaningful 'diff' between the 6642 revsup entries of temp_mw_02.txt and add3c.txt. see match3_3.txt.

Each line has 2 parts separated by '::'.

first part is for temp_mw_02.txt and shows
- k1 from metaline
- L from metaline
- 'e' from metaline ( for example, a value of '3' is shows as 'H3' in displays)
- S (for <info n="sup") or R (for <info n="rev")
- 'pc' from metaline
The second part has similar info from add3c.txt.
- but there is no 'S/R' field as this distinction not present in add3c)

845 lines have 'xxx' for first mw part. These were selected by the 'diff' process (using python difflib.Differ() - see match3.py for details). Similarly 845 lines have 'yyy' for second add3c part.

Note that (7487 - 845) = 6642 (the number of rev/sup entries).

funderburkjim commented 2 months ago

next ?

This match3_3 was what I was aiming for as a way to identify sequence errors in the mw rev/sup entries. The obvious next activity seems to be to examine each of those 845 mw xxx (and corresponding add3c yyy) with an eye to moving the mw where needed. For two examples:

akavara -- Current mw placement is 'alphabetical' between akavaca and akavi. Should it be back after akabbara, since 'akabara, akabbara, akavara' are alternate spellings? I'm not sure one way or the other.
akiMcid -- The current mw placement is among the k's
- kiMcitkara, akiMcitkara, akiMcitkara, **akiMcid**, kiMcitpare .
- Here it seems clear that akiMcid in mw should be moved among the a's.
- akAsAra, **akiMcid**, akiYcana, ...

gasyoun commented 2 months ago

akavara -- Current mw placement is 'alphabetical' between akavaca and akavi. Should it be back after akabbara, since 'akabara, akabbara, akavara' are alternate spellings? I'm not sure one way or the other.

@drdhaval2785 @martingluckman @Andhrabharati what is your stand?

Andhrabharati commented 2 months ago

I would suggest keeping the "grouped" entries' elements together close-by, preferably differing in L-numbering only in the last decimal places, if separated (as in MW at present); but ultimately wish they should be "kept" as a single entry as done in GRA.

And it is not out of place to mention that there are more such entries (few scores) from the main pages as well [in addition to some (again few scores) in the annexure pages], that got separated far-off, taking the alphabetical order as a criterion.

MW has employed a spl. ordering in the whole text, which is not alphabetical throughout.

gasyoun commented 2 months ago

not alphabetical throughout

Indeed, so we should not stick that strict to it?

Andhrabharati commented 2 months ago

not alphabetical throughout

Indeed, so we should not stick that strict to it?

It is more appropriate to "understand" his process and then "follow" the same.

funderburkjim commented 2 months ago

phase 3 work

I had envisioned that the print ordering of supplement entries could be the same as the 'body' ordering. However, this seems not possible. I did find a small number (~50) of 'obvious' repositionings. For instance, most of the 'eka' compounds of the supplement required repositioning. There still remain nearly 800 ordering differences (consult 'xxx' and 'yyy' in match3_3b.txt).

The first instance yyy instance akzaprapAtana at L = 666.1 seems correctly positioned in the body (albeit for a quirky reason).

Probably some of these 800 need repositioning, but I have developed no decision process thus far to distinguish between those ordering differences (between body and supplement) which

can be resolved by repositioning
require no repositioning (like akzaprapAtana ).

Thus I'm calling this the end of my work in this issue.

I want to mention a couple oddities regarding the supplement, and then close this issue.

funderburkjim commented 2 months ago

superfluous supplement entries

The poster-child here is headword akzoDuka

MW body, page 4

MW supplement page 1308

THESE ARE THE SAME. Why is there this supplement? How many more are like this? For these, should we remove the duplicate supplement (a print change) or keep it so future users will ponder?

funderburkjim commented 2 months ago

supplement text (\X\)

There are about 100 supplement entries (add3d.txt) whose text is only a parenthetical literary source; an example is seen in akzetrajYa above.

The main body (p. 4 of print):

What is the interpretation of this?

The current cdsl markup treats this as a revision.

<L>710<pc>4,1<k1>akzetrajYa<k2>a-kzetra—jYa<e>3
<s>a-kzetra—jYa</s> 
[<ls>ŚBr.</ls>; <ls>Pāṇ. vii, 3, 30</ls>] or    <<< NOTE ŚBr. placement
<s>a/-kzetra—vid</s> 
[<s>a/kz°</s>, <ls>RV. v, 40, 5</ls> & <ls n="RV.">x, 32, 7</ls>], 
¦ not finding out the way. 
<info or="710,akzetrajYa;711,akzetravid"/>
<info n="rev" pc="1308,3"/>
<info lex="m:f:n"/>

What is the rationale for this placement of ŚBr. ? Note also that the accent of the supplement (`a/-kzetra—jYa) is NOT present in the cdsl markup -- another kind of 'error'.

Are all of these 'naked-ls' supplement entries treated in a systematic way? Are they all marked as <info n="rev">.

funderburkjim commented 2 months ago

I forgot to mention that I examined the <OR/> items in add3d, and did repositioning to have a 'group' represented in markup by sequential entries (i.e. no intervening entries).

I think the same can be done systematically in all the 6000+ <info or="..."/> grouped entries in mw.txt; will open another issue for this.

funderburkjim commented 2 months ago

Obviously still open questions about the supplement. Let them be addressed in other issues as interested.

Closing this issue now.

Andhrabharati commented 2 months ago

I think the same can be done systematically in all the 6000+ <info or="..."/> grouped entries in mw.txt; will open another issue for this.

@funderburkjim

are you going to be on this piece of work next, or taking some break (so that I might take-up what I had mentioned above)?

funderburkjim commented 2 months ago

@Andhrabharati I was thinking of starting this piece of work now - When that is done, I can return to AE (or other non-mw work) while you undertake thorough job, looking at all the suppl. matter in mw. Does this informal 'schedule' work for you?

Andhrabharati commented 2 months ago

PERFECT! (I was just about to remind you of AE!!)

sanskrit-lexicon / MWS