Closed funderburkjim closed 8 months ago
<h>
appears in metalines in pw-main, but not in pw-vn.
<h>N
and embed it into k2 (cf. GRA markup)*
- 29339 matches for "[*].*¦"
. Boehtlingk.
3103 matches for "√{#.*¦"
1294 matches for "!√{#.*¦"
√ and !
do not appear in k2 of metaline
*
appears in k2 of metaline: 29069 matches for "<k2>[*]"
@funderburkjim
It appears that you are analyzing the pw file data, for marking the prospective alt. HWs yourself.
I had mentioned earlier (in #104) that the header portion may be looked at to get these words; but there appear to be more entries that contain alt. HWs (which I had missed before in my posted file that formed the base for the cdsl version) after the broken bar.
Here are a few (43 no.s) such L-entries on a quick searching--
16678, 22390, 23281, 26293, 28410, 30597, 34090, 34556, 36315, 37930, 39355, 39831, 39852, 43702, 43761, 44078, 44931, 46971, 55339, 55378, 56092, 59172, 61689, 61801, 74511, 77163, 80864, 87626, 88417, 89051, 94286, 97985, 100189, 102651, 103657, 110117, 110192, 112382, 113429, 115433, 125488, 130735, 133969
There could be more like these, and hope you would try to get those as well.
--------------------------------------------
PS. You had tried to get to the <7k figure that I mentioned at #104, and suggested discarding the hom-tag entries in the process; but there are little over 200 entries having a hom-tag that contain alt. HWs. [My earlier count was 6986.]
Once you finish your exercise, I might be able to compare your file with my version (not posted so far) for any missings/changes.
Yes, I am working on those alt-headwords now. Will share my work with you for comparison, perhaps later this week.
@funderburkjim
Now, I've looked for the 'missed' alt. HW entries in the erstwhile pwkvn file, and found 15 such.
<L>200221<pc>1-284-a<k1>ajagAva<k2>ajagAva
<L>203121<pc>2-299-c<k1>KaqIna<k2>KaqIna
<L>203284<pc>3-247-a<k1>aganDi<k2>aganDi
<L>203898<pc>3-253-c<k1>arTasaMhata<k2>arTasaMhata
<L>204099<pc>3-256-a<k1>AmravAwaka<k2>*AmravAwaka
<L>204416<pc>3-259-b<k1>kalvowaka<k2>kalvowaka
<L>204577<pc>3-261-a<k1>gArDapfzwa<k2>gArDapfzwa
<L>205639<pc>4-299-a<k1>cUlha<k2>cUlha
<L>206832<pc>5-249-c<k1>ikawI<k2>*ikawI
<L>207055<pc>5-252-a<k1>kikviwa<k2>kikviwa
<L>207313<pc>5-255-a<k1>jvalana<k2>jva/lana
<L>209005<pc>6-302-b<k1>devaniSrayaRI<k2>devaniSrayaRI
<L>209263<pc>6-305-b<k1>mAtaNgavedi<k2>*mAtaNgavedi
<L>213195<pc>7-312-a<k1>avimanas<k2>avimanas
<L>213697<pc>7-315-c<k1>asevana<k2>asevana
These entries have the alt. HWs after the broken bar, as are the pwk (main) entries listed above.
My approach to determine NEW secondary headwords for an entry is based on analysis of the 'broken-bar' line of the entry.
grep -E '^<L>' ../temp_pw_2.txt | wc -l
grep -E '{#.*?#}.*{#.*?#}.*¦' ../temp_pw_2.txt | wc -l
grep -E '√.*¦' ../temp_pw_2.txt | wc -l
Excluding these two groups, there are approx. 31054 entries which might
have multiple headwords.
grep -E '^[^{√]*{#[^#]*#}[^{]*¦.*{#' ../temp_pw_2.txt | wc -l
But there are many patterns in these 31054 entries which disqualify the
entry from implying extra headwords. For example, in {#akulI#}¦ <ab>v. l.</ab> für {#aNkulI#}.
, {#aNkulI#}
is not an extra headword. The pattern is <ab>v. l.</ab> für {#
RESTRICTED to cases with just one {#X#} after ¦
After excluding this and many other patterns, there remain 10788 candidates. Certainly some of these should also be excluded; but at this point, the exclusion-by-pattern approach has become unproductive.
Thus, the approach changes course to apply patterns to INCLUDE subsets of the candidates. For such an included entry, there need to be changes to:
<k2>
field of metaline.The file temp_change_3_01.txt
Shows the changes made for the pattern: {#°tA#}
or {#°tva#}
(is the only {#X#} after ¦ ).
There are 915 cases here.
Later pw.txt will be changed by applying the changes in this file, and similar changes for other patterns.
Eventually, this approach should get most of the implied extra headwords.
@Andhrabharati Before proceeding much further, I wanted your take on this approach.
I would suggest 'restricting' this phase to mark and bring-out the 'primary HWs' (as I termed them) [single or 'grouped' entries], that occur at the beginning of the entry in the printed lexicon.
We definitely need to bring-out other 'inner' HWs also, that occur in multiple ways, but this could be done in another/next phase - (a) in-line HWs [that are within braces in the running text] (b) implied HWs [mostly with a preceding "also' etc.] (c) indicative HWs [like the -tA & -tva varieities, that mostly do not have any 'objective' body, but just a mention of the word] (d) variant form HWs [with a preceding "written", "v. l.", "w. r." etc.]
And then, we should look for the composite/compound words that occur inside the body portion of the above entries, and suitably bring them out.
BTW, I see 158375 <L>
entries, not 158370 as mentioned above, in the combined pw.txt.
My personal opinion is that we should mark these 'secondary' HWs with <div n="x" >
style [as done in GRA], "x" covering various forms that we come across in the particular lexicon.
And then, list those various groups within the 'main' entry somewhere (like the separate althw file seen in some cdsl works), to come under the "search" criterion.
This approach would retain the digital text in a form closer to the printed work. [BTW, this is the approach that I took in revising the MW-dev data; my ultimate goal being to bring all the cdsl works in the similar format, making it a 'theme' all across the works.]
BTW, I see 158375 \
entries, not 158370
I merged 5 entries:
parvan L=96646 merge into 69945 pora
pravAla L=73144 merge into 73143, pravAqa
AzwakIya L=16300 merge into 16299 Azwaka
Dru L= 55950 merge into Dru 55949
peSI L=69764 merge into peSI 69763
475 matches for "^!√{#[^#]*y#}¦, {#°yat[ie]#}" already marked
102 matches for "^{#[^#]*y#}¦, {#°yat[ie]#}" add !√ markup
I merged 5 entries:
Looked for other entries with similar pattern ¦\n<LEND
, and found that 5 entries (44904, 49112, 53788, 58874 and 113078) have missed the body portion.
102 matches for "^{#[^#]*y#}¦, {#°yat[ie]#}" add !√ markup
All these occur in the erstwhile pwkvn portion; I seem to have skipped marking them. Probably there could be some places missing the √ mark as well in this portion.
BTW, found one entry L-45920, which has √ mark, but should be with the !√ mark; and it has a typo tilakaya for tilakay
other entries with similar pattern ¦\n<LEND
Those five had missing text in cdsl pw. I have added the text. temp_missing_body_AB.txt
Note: corrected the tilakaya entry.
Probably there could be some places missing the √ mark
I generated a list of possible missing √ mark entries by
search for headword pattern consonant-root-consonant
regex without hom (python syntax)
r'^[*]?{#[kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|][aAiIuUfFxXeEoOMH][kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|]#}¦'
regex with hom:
r'^<hom>[^<]*</hom> [*]?{#[kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|][aAiIuUfFxXeEoOMH][kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|]#}¦'
513 found.
Manually examined all.
details: temp_possible_roots_edit.txt
@Andhrabharati Do you agree that √ mark should be added (before ¦) for these 358?
Yes pl., somehow I had skipped these markings!!
Now, I've looked for the 'missed' alt. HW entries in the erstwhile pwkvn file, and found 15 such.
Found another 8 entries in pwkvn portion, that come under this-- L-200048, L-200334, L-201819, L-206328, L-208193, L-209653, L-214672, L-221290
@Andhrabharati Before proceeding much further, I wanted your take on this approach.
Just curious to know your conclusion on how to proceed further on the task, @funderburkjim !!
I've marked the additional 358 roots.
'primary HWs' (as I termed them) [single or 'grouped' entries], that occur at the beginning of the entry in the printed lexicon'close' althws
I like this idea and am proceeding to see if I find any more in addition to those you have mentioned in above comments.
Glad to hear this, @funderburkjim !
Working with a 'common' thinking/process definitely makes the collaborative effort easier, facilitating the comparision (between the two works) quicker and fruitful.
I have many more entries that come under the alt.HW type & the 'root' type now.
This file contains changes for alternate headwords from 2 sources:
@Andhrabharati Please review and provide corrections as needed.
Note: I have used the GRA model for deriving extra entries for PW (pw_hwextra.txt) from <k2>
.
Currently: 2516 extra headwords from 1589 <k2>
s.
many more entries that come under the alt.HW type & the 'root' type now.
If you provide these, I'll make changes for them.
@funderburkjim
My file now contains 713 (main) and 682 (vn) lines differing with the CDSL (combined) file, ignoring the meta-lines (as I did not populate the k2-field yet).
; 05: 93 entries - alternate headwords
If you post your full file (containing all the changes in your 5 steps), I can do a diff with my file and list out the differing lines.
temp_pw_2.zip my current version temp_change_pw_0_2.txt Changes from the current csl-orig pw.txt. All changes thus far were done while keeping the number of lines the same. This file shows the line-by-line diff.
Further details in the pwkissues/issue106 folder.
Thank you @funderburkjim for the files.
Seen that just over 900 lines (616: main and 315: vn) are differing between our files.
Will go through them tomorrow and after necessary corrections (if any) in my file, shall post the differing lines for your persual and further action.
Here are the files that I had made--
And the corresponding file from my side: temp_pwkvn_2 (AB).txt
[Pl. note that my file does not contain the trailing <info(.*)>
tag.]
After "incorporating" necessary corrections in my file, there are 450 differing lines in the VN portion with the CDSL file
Hope @funderburkjim wouldn't be having much issues in using my file. [I can generate (and post) the diff. file, in case he feels any difficulty with the above AB file.]
Now, coming to the pw main data, here are 206 header portions with dhAtu (√) markup-- dhAtu header lines.txt
Hope, this is convenient enough to be "used" by Jim.
-----------------------------
There still remain about 390 diff. lines, out of which 34 lines contain the _ (underscore) character.
Though most of those could be removed as done by Jim (for slp1 has no scope for confusion of vowel-hiatus [but I wonder if these would all "pass" the round-robin test of conversion to another script like Devanagari or IAST and back!!]), I feel some of them need to be retained as they denote a 'space' character within the Devanagari string.
Another 30 lines have the <ls n="Chr.">
markup by Jim, which do not point to Boehtlingk's Chrestomathie at all; I had marked them with the 〔...〕markup, so that they would be easily traceable for properly tagging to their resp. works.
BTW, there are quite a few such places in the pwkvn portion as well, which I had already posted above (with the same markup).
Note: I have used the GRA model for deriving extra entries for PW (pw_hwextra.txt) from
<k2>
. Currently: 2516 extra headwords from 1589<k2>
s.
@funderburkjim
Would you mind explaining about this 1589 number? I see a huge number of entries that count to nearly 5-6 times of this!
temp_pwkvn_2.AB.txt
temp_change_2_3_04.txt In these (11) instances, AB removes the √ markup. However, Jim thinks these ARE roots (cf. the German translation), hence is retaining the √ markup. @Andhrabharati Agree?
The remaining 439 (450 - 11) AB changes are agreed by Jim. Details in change_2_3.txt
They are of 3 types:
The 205 in dhatu.header.lines remain to be processed by Jim.
30 lines have the
markup by Jim
@Andhrabharati Please provide details along with sample of your markup so cdsl can remove inappropriate markup.
- L=207313 ? Diff in accent only. This impacts interpretation by'althws' program in csl-orig
There are more such occurrences throughout the CDSL, incl. the pwk.
- L=208905 ? two identical {#X#} before ¦. Similarly impacts althws interpretation
The 2nd word has a typo, that I had missed before!
However, pwkvn-7 & SCH have the entries correctly rendered.
---------------------------------
Pl. have a look at the change_2_3 (AB).txt having 'updated' the corrections in Jim's file.
30 lines have the markup by Jim
@Andhrabharati Please provide details along with sample of your markup so cdsl can remove inappropriate markup.
There are 60 places in pwkvn portion (and 16 places in the CDSL pwk itself) that have the 〔...〕, which are to be associated with their proper sources.
And, there are another 27 (vn) + 59 (main) cases of <ls>[0-9]+
that need to be handled similarly.
-----------------------------
Here is how I had worked out the Chr. taggings (at the very initial days)--
Boethlingk's 2nd ed. of Chrestomathie (1877) has been cited in pwk [that ran into 329 pp. (main text) and does not contain more than 34 lines at any page], unless Benfey's ed. is specifically mentioned.
Chr.">([0-9]+),[0-9]{4}
12 casesChr.">([0-9]+),[0-9]{3}
57 casesFinally, here are the 30 remaining instances that I had untagged as non-Chr. citations-- non-Chr citations.txt
Here are the changed lines-- non-Chr. citation lines.txt
BTW, it is noted that ~1000 Chr. instances having multiple citations together are still not expanded as individual (separate) citations.
For example, the entry aMhas has <ls n="Chr.">1,10. 6,18</ls>
that does not lead to the 6,18 link.
Probably this may be taken up sometime sooner, before it skips the mind.
While still at the B.'s Chrestomathie, seen that the entries aMhas and aparvan have 1,10 as a Chr. citation; but the aparvan citation should point to RĀJAN. (previous ls-work) and not to Chr.
Probably there would be more such instances that need untagging <ls n="Chr.">
temp_change_2_3_04.txt In these (11) instances, AB removes the √ markup. However, Jim thinks these ARE roots (cf. the German translation), hence is retaining the √ markup. @Andhrabharati Agree?
Here is my response to each of the 11 entries-- temp_change_2_3_04 (AB).txt
As such, I stand by my earlier markup in these entries.
Just by accident, seen few (<20) instances of 'ks' where 'kz' should've been there within the slp1 strings {#...#}
. There might be such cases in other CDSL works as well, that need to be identified and corrected.
Here is the final diff. file (353 differences), ignoring the 34 '_' instances (which need to be attended to depending on Jim's response) [and 30 non-Chr. tags and 206 dhAtu tags which are posted above]-- diff_pw.txt
Same conditions as at my earlier post apply to this.
The 205 in dhatu.header.lines remain to be processed by Jim.
@funderburkjim
After you process these and other corrections as listed in the above posts (probably leaving the Chr. expansions), request you to post the full file(s) again [so that I can redo the comparison with my file(s) and go to the next step].
Just by accident, seen few (<20) instances of 'ks' where 'kz' should've been there within the slp1 strings
{#...#}
. There might be such cases in other CDSL works as well, that need to be identified and corrected.
Thought of looking at MW for such instances and found 15 such!! [While some of these are to be changed to 'kz', others are with typo errors.] mw- bad slp words having 'ks'.txt
@drdhaval2785 Pl. take a note of this and do the needful, as I do not want Jim to divert from the pwk work for now.
Corrected in csl-orig repository for MW as per above comment and above commit.
(as I did not populate the k2-field yet).
After a long gap, I have got into some mood to take up big works; the first task I did is to populate the k2-field with comma-separated lists (as applicable).
Now my file has 1567 entries in pwk-vn portion (with 2440 extra words) and 7288 entries in pwk-main portion (with 9782 extra words).
That's just what I am working on today! Not sure how to proceed.
What do you suggest?
Option 2.
@funderburkjim how deep are you?
I did the said work in two days; so estimate Jim to take anywhere between 7-10 days (if he is on this work alone), @gasyoun !!
temp_pw_4.zip has my latest version.
relative to my previous post:
[H]
type multi-headwordsI think these changes take into account the various AB mentions,
with exception of the ~1000 Chr. instances having multiple citations together are still not expanded as individual (separate) citations.
pw body part: AB has 250+ more than Jim
7004 entries with k2 multiple headwords (AB 7288)
9536 total k2 headwords (AB 9782)
pw vn part (AB and Jim about the same)
1569 entries with k2 multiple headwords (AB 1567)
2447 total k2 headwords (AB 2440)
Will be interested to see source of the diffs, esp. in pw body part.
I did another round of revision in past two days and the current statistics are--
pwk-vn portion
pwk-main portion
Shall look at the differences now with Jim's latest version and post the details.
Starting with the smaller portions--
pwkvn_4 differences-1 (metalines and header lines).txt
Here is the supporting document for the L-205237
It may be noted that I had introduced a new kind of markup of the filled-up portions at the entries.
Also I had separated the portions where a upasarga is clubbed together with a derived word (of the prefixed dhAtu), and also the multiple entries together were separated as individual entries within {#...#}.
And then, I had marked the prefix portions under the dhAtu entries with a <div n="p">
tag, and pushed them to new lines (such 'new' lines count to 433); this is done in the pwkvn portion to bring it into the same style as in the pwk main portion.
Many miscellaneous changes (that seemed appropriate) were also done on-the-fly.
Here are another two portions from pwkvn part--
And here is how the CDSL file looks with the upasarga splits implemented--
If the above are carried in to the CDSL version and revised file is posted, I shall post the next comparison parts.
If I give a brief about the spl. markup introduced for the filled-up portions at the HW level, probably Jim might appreciate my idea and take up necessary action further (as I intended).
We tackle the task of generating alternate headwords for pw dictionary.
Preliminary outline of the approach:
Note: no attempt to generate alternate headwords from upasargas of verb entries.