Alternate headwords for pw

funderburkjim commented 9 months ago

We tackle the task of generating alternate headwords for pw dictionary.

Preliminary outline of the approach:

Filter entries based on the first line of data (the line after the metaline)
Parse the implied headwords (based on the broken bar in that first line of entry)
- recognize also hom and (?) roots
Use this parse to construct k2 of the metaline base. When there is more than one headword, this will result in a comma-separated list in k2
construct parallel list of k1 from the list of k2.
construct pw_hwextra.txt (for csl-orig) from the k1-k2 list
- this will generate essentially duplicate entries in pw.xml for the extra 'alternate' headwords.

Note: no attempt to generate alternate headwords from upasargas of verb entries.

funderburkjim commented 9 months ago

markup observations

<h> appears in metalines in pw-main, but not in pw-vn.
- the present work will remove that <h>N and embed it into k2 (cf. GRA markup)
* - 29339 matches for "[*].*¦" . Boehtlingk.
- A word, a meaning, a construction or a gender that has so far only been listed by grammarians or lexicographers has been designated with *. ref:
√ indicates a root . Andhrabharati addition. 3103 matches for "√{#.*¦"
√! indicates a 'denominative' root. Andhrabharati addition. 1294 matches for "!√{#.*¦"

√ and ! do not appear in k2 of metaline * appears in k2 of metaline: 29069 matches for "<k2>[*]"

Andhrabharati commented 9 months ago

@funderburkjim

It appears that you are analyzing the pw file data, for marking the prospective alt. HWs yourself.

I had mentioned earlier (in #104) that the header portion may be looked at to get these words; but there appear to be more entries that contain alt. HWs (which I had missed before in my posted file that formed the base for the cdsl version) after the broken bar.

Here are a few (43 no.s) such L-entries on a quick searching--

16678, 22390, 23281, 26293, 28410, 30597, 34090, 34556, 36315, 37930, 39355, 39831, 39852, 43702, 43761, 44078, 44931, 46971, 55339, 55378, 56092, 59172, 61689, 61801, 74511, 77163, 80864, 87626, 88417, 89051, 94286, 97985, 100189, 102651, 103657, 110117, 110192, 112382, 113429, 115433, 125488, 130735, 133969

There could be more like these, and hope you would try to get those as well. -------------------------------------------- PS. You had tried to get to the <7k figure that I mentioned at #104, and suggested discarding the hom-tag entries in the process; but there are little over 200 entries having a hom-tag that contain alt. HWs. [My earlier count was 6986.]

Andhrabharati commented 9 months ago

Once you finish your exercise, I might be able to compare your file with my version (not posted so far) for any missings/changes.

funderburkjim commented 9 months ago

Yes, I am working on those alt-headwords now. Will share my work with you for comparison, perhaps later this week.

Andhrabharati commented 9 months ago

@funderburkjim

Now, I've looked for the 'missed' alt. HW entries in the erstwhile pwkvn file, and found 15 such.

<L>200221<pc>1-284-a<k1>ajagAva<k2>ajagAva
<L>203121<pc>2-299-c<k1>KaqIna<k2>KaqIna
<L>203284<pc>3-247-a<k1>aganDi<k2>aganDi
<L>203898<pc>3-253-c<k1>arTasaMhata<k2>arTasaMhata
<L>204099<pc>3-256-a<k1>AmravAwaka<k2>*AmravAwaka
<L>204416<pc>3-259-b<k1>kalvowaka<k2>kalvowaka
<L>204577<pc>3-261-a<k1>gArDapfzwa<k2>gArDapfzwa
<L>205639<pc>4-299-a<k1>cUlha<k2>cUlha
<L>206832<pc>5-249-c<k1>ikawI<k2>*ikawI
<L>207055<pc>5-252-a<k1>kikviwa<k2>kikviwa
<L>207313<pc>5-255-a<k1>jvalana<k2>jva/lana
<L>209005<pc>6-302-b<k1>devaniSrayaRI<k2>devaniSrayaRI
<L>209263<pc>6-305-b<k1>mAtaNgavedi<k2>*mAtaNgavedi
<L>213195<pc>7-312-a<k1>avimanas<k2>avimanas
<L>213697<pc>7-315-c<k1>asevana<k2>asevana

These entries have the alt. HWs after the broken bar, as are the pwk (main) entries listed above.

funderburkjim commented 9 months ago

systematic additional headwords

My approach to determine NEW secondary headwords for an entry is based on analysis of the 'broken-bar' line of the entry.

158370 is the current count for pw.txt entries.
- grep -E '^<L>' ../temp_pw_2.txt | wc -l
8444 entries are previously identified as multiple headwords
- grep -E '{#.*?#}.*{#.*?#}.*¦' ../temp_pw_2.txt | wc -l
3103 entries are identified as roots
- grep -E '√.*¦' ../temp_pw_2.txt | wc -l

Excluding these two groups, there are approx. 31054 entries which might have multiple headwords. grep -E '^[^{√]*{#[^#]*#}[^{]*¦.*{#' ../temp_pw_2.txt | wc -l

But there are many patterns in these 31054 entries which disqualify the entry from implying extra headwords. For example, in {#akulI#}¦ <ab>v. l.</ab> für {#aNkulI#}., {#aNkulI#} is not an extra headword. The pattern is <ab>v. l.</ab> für {# RESTRICTED to cases with just one {#X#} after ¦

After excluding this and many other patterns, there remain 10788 candidates. Certainly some of these should also be excluded; but at this point, the exclusion-by-pattern approach has become unproductive.

Thus, the approach changes course to apply patterns to INCLUDE subsets of the candidates. For such an included entry, there need to be changes to:

the broken bar line ( ¦ moves after last sub-headword phrase {#X#} )
the <k2> field of metaline.

The file temp_change_3_01.txt Shows the changes made for the pattern: {#°tA#} or {#°tva#} (is the only {#X#} after ¦ ). There are 915 cases here. Later pw.txt will be changed by applying the changes in this file, and similar changes for other patterns.
Eventually, this approach should get most of the implied extra headwords.

@Andhrabharati Before proceeding much further, I wanted your take on this approach.

Andhrabharati commented 9 months ago

I would suggest 'restricting' this phase to mark and bring-out the 'primary HWs' (as I termed them) [single or 'grouped' entries], that occur at the beginning of the entry in the printed lexicon.

We definitely need to bring-out other 'inner' HWs also, that occur in multiple ways, but this could be done in another/next phase - (a) in-line HWs [that are within braces in the running text] (b) implied HWs [mostly with a preceding "also' etc.] (c) indicative HWs [like the -tA & -tva varieities, that mostly do not have any 'objective' body, but just a mention of the word] (d) variant form HWs [with a preceding "written", "v. l.", "w. r." etc.]

And then, we should look for the composite/compound words that occur inside the body portion of the above entries, and suitably bring them out.

BTW, I see 158375 <L> entries, not 158370 as mentioned above, in the combined pw.txt.

Andhrabharati commented 9 months ago

My personal opinion is that we should mark these 'secondary' HWs with <div n="x" > style [as done in GRA], "x" covering various forms that we come across in the particular lexicon.

And then, list those various groups within the 'main' entry somewhere (like the separate althw file seen in some cdsl works), to come under the "search" criterion.

This approach would retain the digital text in a form closer to the printed work. [BTW, this is the approach that I took in revising the MW-dev data; my ultimate goal being to bring all the cdsl works in the similar format, making it a 'theme' all across the works.]

funderburkjim commented 9 months ago

BTW, I see 158375 \ entries, not 158370

I merged 5 entries:

parvan L=96646 merge into 69945 pora
pravAla L=73144 merge into  73143, pravAqa
AzwakIya L=16300 merge into 16299 Azwaka
Dru L= 55950 merge into Dru 55949
peSI L=69764 merge into peSI 69763

funderburkjim commented 9 months ago

additional denominative roots

475 matches for "^!√{#[^#]*y#}¦, {#°yat[ie]#}"  already marked
102 matches for "^{#[^#]*y#}¦, {#°yat[ie]#}"  add !√ markup

Andhrabharati commented 9 months ago

I merged 5 entries:

Looked for other entries with similar pattern ¦\n<LEND, and found that 5 entries (44904, 49112, 53788, 58874 and 113078) have missed the body portion.

Andhrabharati commented 9 months ago

102 matches for "^{#[^#]*y#}¦, {#°yat[ie]#}" add !√ markup

All these occur in the erstwhile pwkvn portion; I seem to have skipped marking them. Probably there could be some places missing the √ mark as well in this portion.

BTW, found one entry L-45920, which has √ mark, but should be with the !√ mark; and it has a typo tilakaya for tilakay

funderburkjim commented 9 months ago

other entries with similar pattern ¦\n<LEND

Those five had missing text in cdsl pw. I have added the text. temp_missing_body_AB.txt

Note: corrected the tilakaya entry.

funderburkjim commented 9 months ago

Probably there could be some places missing the √ mark

I generated a list of possible missing √ mark entries by search for headword pattern consonant-root-consonant

regex without hom (python syntax)
r'^[*]?{#[kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|][aAiIuUfFxXeEoOMH][kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|]#}¦'

regex with hom:  
r'^<hom>[^<]*</hom> [*]?{#[kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|][aAiIuUfFxXeEoOMH][kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|]#}¦'

513 found.
Manually examined all.

358 need √ mark (251 of these are in the pwkvn entries)
155 don't need √ mark.

details: temp_possible_roots_edit.txt

@Andhrabharati Do you agree that √ mark should be added (before ¦) for these 358?

Andhrabharati commented 9 months ago

Yes pl., somehow I had skipped these markings!!

Andhrabharati commented 9 months ago

Now, I've looked for the 'missed' alt. HW entries in the erstwhile pwkvn file, and found 15 such.

Found another 8 entries in pwkvn portion, that come under this-- L-200048, L-200334, L-201819, L-206328, L-208193, L-209653, L-214672, L-221290

Andhrabharati commented 9 months ago

@Andhrabharati Before proceeding much further, I wanted your take on this approach.

Just curious to know your conclusion on how to proceed further on the task, @funderburkjim !!

funderburkjim commented 9 months ago

interim progress

I've marked the additional 358 roots.

'primary HWs' (as I termed them) [single or 'grouped' entries], that occur at the beginning of the entry in the printed lexicon'close' althws

I like this idea and am proceeding to see if I find any more in addition to those you have mentioned in above comments.

Andhrabharati commented 9 months ago

Glad to hear this, @funderburkjim !

Working with a 'common' thinking/process definitely makes the collaborative effort easier, facilitating the comparision (between the two works) quicker and fruitful.

Andhrabharati commented 9 months ago

I have many more entries that come under the alt.HW type & the 'root' type now.

funderburkjim commented 9 months ago

altheadwords

temp_change_1a_2_althws.txt

This file contains changes for alternate headwords from 2 sources:

66 that you have listed above
26 that I have added

@Andhrabharati Please review and provide corrections as needed.

Note: I have used the GRA model for deriving extra entries for PW (pw_hwextra.txt) from <k2>. Currently: 2516 extra headwords from 1589 <k2>s.

many more entries that come under the alt.HW type & the 'root' type now.

If you provide these, I'll make changes for them.

Andhrabharati commented 9 months ago

@funderburkjim

My file now contains 713 (main) and 682 (vn) lines differing with the CDSL (combined) file, ignoring the meta-lines (as I did not populate the k2-field yet).

; 05: 93 entries - alternate headwords

If you post your full file (containing all the changes in your 5 steps), I can do a diff with my file and list out the differing lines.

funderburkjim commented 9 months ago

temp_pw_2.zip my current version temp_change_pw_0_2.txt Changes from the current csl-orig pw.txt. All changes thus far were done while keeping the number of lines the same. This file shows the line-by-line diff.

Further details in the pwkissues/issue106 folder.

Andhrabharati commented 9 months ago

Thank you @funderburkjim for the files.

Seen that just over 900 lines (616: main and 315: vn) are differing between our files.

Will go through them tomorrow and after necessary corrections (if any) in my file, shall post the differing lines for your persual and further action.

Andhrabharati commented 9 months ago

Here are the files that I had made--

separated the VN data from the combined CDSL file: pwkvn_2 (CDSL).txt
deleted the metaline to ease comparing with my file: temp_pwkvn_2 (CDSL).txt

And the corresponding file from my side: temp_pwkvn_2 (AB).txt [Pl. note that my file does not contain the trailing <info(.*)> tag.]

After "incorporating" necessary corrections in my file, there are 450 differing lines in the VN portion with the CDSL file

some of these belong to the header portion that need to be carried into the metaline
some are just the relocation of the broken bar, not affecting the metaline, and
some are within the body portion, not affecting the metaline

Hope @funderburkjim wouldn't be having much issues in using my file. [I can generate (and post) the diff. file, in case he feels any difficulty with the above AB file.]

Andhrabharati commented 9 months ago

Now, coming to the pw main data, here are 206 header portions with dhAtu (√) markup-- dhAtu header lines.txt

Hope, this is convenient enough to be "used" by Jim. -----------------------------

There still remain about 390 diff. lines, out of which 34 lines contain the _ (underscore) character.

Though most of those could be removed as done by Jim (for slp1 has no scope for confusion of vowel-hiatus [but I wonder if these would all "pass" the round-robin test of conversion to another script like Devanagari or IAST and back!!]), I feel some of them need to be retained as they denote a 'space' character within the Devanagari string.

Andhrabharati commented 9 months ago

Another 30 lines have the <ls n="Chr."> markup by Jim, which do not point to Boehtlingk's Chrestomathie at all; I had marked them with the 〔...〕markup, so that they would be easily traceable for properly tagging to their resp. works.

BTW, there are quite a few such places in the pwkvn portion as well, which I had already posted above (with the same markup).

Andhrabharati commented 9 months ago

Note: I have used the GRA model for deriving extra entries for PW (pw_hwextra.txt) from <k2>. Currently: 2516 extra headwords from 1589 <k2>s.

@funderburkjim

Would you mind explaining about this 1589 number? I see a huge number of entries that count to nearly 5-6 times of this!

funderburkjim commented 9 months ago

temp_pwkvn_2.AB.txt

temp_change_2_3_04.txt In these (11) instances, AB removes the √ markup. However, Jim thinks these ARE roots (cf. the German translation), hence is retaining the √ markup. @Andhrabharati Agree?

The remaining 439 (450 - 11) AB changes are agreed by Jim. Details in change_2_3.txt

They are of 3 types:

01: 149 add !√ for denominative roots
02: 243 add √ for other roots
03: 47 change ¦ placement. 40 of these also require changes to metaline.
- L=207313 ? Diff in accent only. This impacts interpretation by'althws' program in csl-orig
- L=208905 ? two identical {#X#} before ¦. Similarly impacts althws interpretation

The 205 in dhatu.header.lines remain to be processed by Jim.

30 lines have the markup by Jim

@Andhrabharati Please provide details along with sample of your markup so cdsl can remove inappropriate markup.

Andhrabharati commented 9 months ago

L=207313 ? Diff in accent only. This impacts interpretation by'althws' program in csl-orig

There are more such occurrences throughout the CDSL, incl. the pwk.

L=208905 ? two identical {#X#} before ¦. Similarly impacts althws interpretation

The 2nd word has a typo, that I had missed before!

However, pwkvn-7 & SCH have the entries correctly rendered.

---------------------------------

Pl. have a look at the change_2_3 (AB).txt having 'updated' the corrections in Jim's file.

Andhrabharati commented 9 months ago

30 lines have the markup by Jim

@Andhrabharati Please provide details along with sample of your markup so cdsl can remove inappropriate markup.

There are 60 places in pwkvn portion (and 16 places in the CDSL pwk itself) that have the 〔...〕, which are to be associated with their proper sources.

And, there are another 27 (vn) + 59 (main) cases of <ls>[0-9]+ that need to be handled similarly. -----------------------------

Here is how I had worked out the Chr. taggings (at the very initial days)--

Boethlingk's 2nd ed. of Chrestomathie (1877) has been cited in pwk [that ran into 329 pp. (main text) and does not contain more than 34 lines at any page], unless Benfey's ed. is specifically mentioned.

Chr.">([0-9]+),[0-9]{4}12 cases
Chr.">([0-9]+),[0-9]{3}57 cases
and then, some cases where the page no. is more than 329 or line no. is 34 or more were manually verified and corrected where needed.

Finally, here are the 30 remaining instances that I had untagged as non-Chr. citations-- non-Chr citations.txt

Andhrabharati commented 9 months ago

Here are the changed lines-- non-Chr. citation lines.txt

BTW, it is noted that ~1000 Chr. instances having multiple citations together are still not expanded as individual (separate) citations. For example, the entry aMhas has <ls n="Chr.">1,10. 6,18</ls> that does not lead to the 6,18 link.

Probably this may be taken up sometime sooner, before it skips the mind.

Andhrabharati commented 9 months ago

While still at the B.'s Chrestomathie, seen that the entries aMhas and aparvan have 1,10 as a Chr. citation; but the aparvan citation should point to RĀJAN. (previous ls-work) and not to Chr.

Probably there would be more such instances that need untagging <ls n="Chr.">

Andhrabharati commented 9 months ago

temp_change_2_3_04.txt In these (11) instances, AB removes the √ markup. However, Jim thinks these ARE roots (cf. the German translation), hence is retaining the √ markup. @Andhrabharati Agree?

Here is my response to each of the 11 entries-- temp_change_2_3_04 (AB).txt

As such, I stand by my earlier markup in these entries.

Andhrabharati commented 9 months ago

Just by accident, seen few (<20) instances of 'ks' where 'kz' should've been there within the slp1 strings {#...#}. There might be such cases in other CDSL works as well, that need to be identified and corrected.

Here is the final diff. file (353 differences), ignoring the 34 '_' instances (which need to be attended to depending on Jim's response) [and 30 non-Chr. tags and 206 dhAtu tags which are posted above]-- diff_pw.txt

Same conditions as at my earlier post apply to this.

The 205 in dhatu.header.lines remain to be processed by Jim.

@funderburkjim

After you process these and other corrections as listed in the above posts (probably leaving the Chr. expansions), request you to post the full file(s) again [so that I can redo the comparison with my file(s) and go to the next step].

Andhrabharati commented 9 months ago

Just by accident, seen few (<20) instances of 'ks' where 'kz' should've been there within the slp1 strings {#...#}. There might be such cases in other CDSL works as well, that need to be identified and corrected.

Thought of looking at MW for such instances and found 15 such!! [While some of these are to be changed to 'kz', others are with typo errors.] mw- bad slp words having 'ks'.txt

@drdhaval2785 Pl. take a note of this and do the needful, as I do not want Jim to divert from the pwk work for now.

drdhaval2785 commented 9 months ago

Corrected in csl-orig repository for MW as per above comment and above commit.

Andhrabharati commented 9 months ago

(as I did not populate the k2-field yet).

After a long gap, I have got into some mood to take up big works; the first task I did is to populate the k2-field with comma-separated lists (as applicable).

Now my file has 1567 entries in pwk-vn portion (with 2440 extra words) and 7288 entries in pwk-main portion (with 9782 extra words).

funderburkjim commented 9 months ago

That's just what I am working on today! Not sure how to proceed.

I could send my file WITHOUT resolution of the k2 to broken bar
OR I could continue an independent derivation of k2 from broken bar, and then send that file

What do you suggest?

Andhrabharati commented 9 months ago

Option 2.

gasyoun commented 8 months ago

@funderburkjim how deep are you?

Andhrabharati commented 8 months ago

I did the said work in two days; so estimate Jim to take anywhere between 7-10 days (if he is on this work alone), @gasyoun !!

funderburkjim commented 8 months ago

temp_pw_4.zip has my latest version.

change files

relative to my previous post:

change_2_3.txt (sections 04-09)
change_3_4.txt The k2 markup
- does not deal with the [H] type multi-headwords
- Several spelling changes noted in the bb-line.

I think these changes take into account the various AB mentions, with exception of the ~1000 Chr. instances having multiple citations together are still not expanded as individual (separate) citations.

count comparison:

pw body part:   AB has 250+ more than Jim
  7004 entries with k2 multiple headwords  (AB 7288)
  9536 total k2 headwords  (AB 9782)
pw vn part  (AB and Jim about the same)
  1569 entries with k2 multiple headwords  (AB 1567)
  2447 total k2 headwords  (AB 2440)

Will be interested to see source of the diffs, esp. in pw body part.

Andhrabharati commented 8 months ago

I did another round of revision in past two days and the current statistics are--

pwk-vn portion

1562 entries in (with 2436 extra words)
349 √ items and 176 !√ items

pwk-main portion

7297 entries in (with 9719 extra words)
2244 √ items and 978 !√ items

Shall look at the differences now with Jim's latest version and post the details.

Andhrabharati commented 8 months ago

Starting with the smaller portions--

pwkvn_4 differences-1 (metalines and header lines).txt

Here is the supporting document for the L-205237

Andhrabharati commented 8 months ago

It may be noted that I had introduced a new kind of markup of the filled-up portions at the entries.

Also I had separated the portions where a upasarga is clubbed together with a derived word (of the prefixed dhAtu), and also the multiple entries together were separated as individual entries within {#...#}.

And then, I had marked the prefix portions under the dhAtu entries with a <div n="p"> tag, and pushed them to new lines (such 'new' lines count to 433); this is done in the pwkvn portion to bring it into the same style as in the pwk main portion.

Many miscellaneous changes (that seemed appropriate) were also done on-the-fly.

Andhrabharati commented 8 months ago

Here are another two portions from pwkvn part--

pwkvn_4 differences-2 (dhAtu and denominative verbs).txt

pwkvn_4 differences-3 (splits and miscellaneous).txt

Andhrabharati commented 8 months ago

And here is how the CDSL file looks with the upasarga splits implemented--

pwkvn_4 (CDSL) -prefix splits.txt

Andhrabharati commented 8 months ago

If the above are carried in to the CDSL version and revised file is posted, I shall post the next comparison parts.

Andhrabharati commented 8 months ago

If I give a brief about the spl. markup introduced for the filled-up portions at the HW level, probably Jim might appreciate my idea and take up necessary action further (as I intended).

sanskrit-lexicon / PWK