Alternate headwords for pw

funderburkjim commented 5 months ago

We tackle the task of generating alternate headwords for pw dictionary.

Preliminary outline of the approach:

Filter entries based on the first line of data (the line after the metaline)
Parse the implied headwords (based on the broken bar in that first line of entry)
- recognize also hom and (?) roots
Use this parse to construct k2 of the metaline base. When there is more than one headword, this will result in a comma-separated list in k2
construct parallel list of k1 from the list of k2.
construct pw_hwextra.txt (for csl-orig) from the k1-k2 list
- this will generate essentially duplicate entries in pw.xml for the extra 'alternate' headwords.

Note: no attempt to generate alternate headwords from upasargas of verb entries.

funderburkjim commented 3 months ago

Regarding '_'

You are definitely right that a round-trip of transcoding of X (slp1 -> hk - > slp1) does not result in X when X has certain properties (such as an 'ai' or 'au' hiatus, also 'bh', 'gh' , and maybe a few other cases).

A similar comment regarding IAST instead of hk.

My view has been that iast and hk should be viewed as faulty and/or incomplete transcoding schemes for Devanagari. cdsl could take upon itself the task of extending hk and iast to 'remedy' such problems. But, I have not thought the user reward for such a task is great enough to justify the effort, since such anomalies are rare.

While thinking about this, I noticed that the 'simple-search (input=simple)' display needs to be revised so that 'prauga' (MW) yields not only 'prOga' (slp1) but also 'prauga' (slp1).

Andhrabharati commented 3 months ago

My view has been that iast and hk should be viewed as faulty and/or incomplete transcoding schemes for Devanagari.

I've seen that slp1 itself also has the drawback of failing in the round-trip conversion, deva - slp1 - deva (or slp1 - deva - slp1) at such places!!

funderburkjim commented 3 months ago

temp_pw_9b.txt

temp_pw_9b.zip

This incorporates almost all of AB's latest batch of changes. See also change_8b_9.txt, change_9_9a.txt and change_9a_9b.txt for how I analyzed the many different kinds of changes proposed by AB. See diff_9b_ab_2.txt for the differences between temp_pw_9b.txt and AB's final file pw.integrated.AB.v1.for.CDSL.txt.

The changes are also integrated into the displays (locally):

@Andhrabharati When you sign off on temp_pw_9b.txt, I'll install it at Cologne.

funderburkjim commented 3 months ago

I've seen that slp1 itself also has the drawback of failing in the round-trip conversion, deva - slp1 - deva (or slp1 - deva - slp1) at such places!!

I'll believe it when I see it!

I doubt that the Ralph Bunker/Peter Scharf implementation of slp1-deva transcoding has an invertibility problem, but it may be that my implementation is imperfect.

When (if) you encounter such an instance, open a new issue and provide full details, so I can reproduce the problem, and hopefully correct any such imperfections.

Andhrabharati commented 3 months ago

@Andhrabharati When you sign off on temp_pw_9b.txt, I'll install it at Cologne.

Great to see that practically no differences exist between the two versions.

Here are the final changes--

While at two entries (L-17562 and L-73947) the hiatus is removed in the header portion, it remained in the metaline.
The final form concluded at L-124385 prompted me to look for other places having "(besser" and found 3 entries-- diff_9b-1.txt
The SUrpa°RaKI at L-113882 prompted me to look for other places having "[a-z]°[a-z]" and found 8 lines, out of which 5 are typo or print errors

~~232306~~ 212306 . {#daSa°Sata°#} -> , {#daSa°#}, {#Sata°#}
294400 {#nizAda°tva#} -> {#nizAdatva#} ;; print change
565127 SUrpa°RaKI -> SUrpaRaKI
597908 {#sarvaM°yam#} -> {#sarvaM °yam#}
645392 ,%} {#pa°da#} -> %}. {#pada°#}

and the remaining 3 lines are the only 'rare' cases having the ° mark within the string (in the digital text; probably there might be few more, which would come out if and when a full proofing takes place to match the file data with the print - i.e. typo errors) [should we make these changes? if so, what's the best way to do so?]

68425 {#A°nipuRe — dEve#} ;; {#A⁅parvaBaNga⁆nipuRe — dEve#}
~~202051~~ 212051 {#tri°jyotizmatI#} ;; {#tri⁅zwub⁆jyotizmatI#}
306317 {#mAMsaM Sva°nipAtitam#} ;; {#mAMsaM Sva⁅daSanAnaNge⁆nipAtitam#}

Andhrabharati commented 3 months ago

This is one of the longest sessions that took place-- though at may a times going beyond the "subject matter" (due to my 'uncontrolled' way of corrections!)-- but bringing the text into a good form now.

I would like Jim to think of opening two more issues

one to tackle the long-pending (for over three years now, since I had "promised" to give out my results if the corrections are done in cdsl data as per my proposal) "resolution" of ls-entities; this exercise shall now also include making the simplistic ls-tooltips and
another to "integrate" the vn portion into the main pwk portion, in the same way as done in GRA [this would eliminate the 'pure index entries' (without any "body matter") in the pwk7 vn and retain the 'proper' vn entries having some objective "body matter", and bring the total vn entries count close to what SCH mentioned (14450) from the present 22611 and contain the entries]; I had touched this point (of removing the 'index entries') in the very initial days after the pwkvn got typed by Thomas and added as a new repo as point 3b but it did not get Jim's nod for some reason [point 3a got corrected in the present session!!].

I shall take responsibility for these two tasks (the first one does not need much time, and which only I can do [as of now]), but the 2nd one might take a week or so [which Jim could also try out as in GRA initially, and then I had jumped in to give finishing touches jointly].

Look forward to know what Jim decides on this.

Andhrabharati commented 3 months ago

Finally here is the concluding post from my side at this issue--

If I give a brief about the spl. markup introduced for the filled-up portions at the HW level, probably Jim might appreciate my idea and take up necessary action further (as I intended).

While in vast majority cases, the "padding" is done at the front of the compound word (as ⁅X⁆°Y), in just 91 cases it is done at the end (as X°⁅Y⁆).

I had presumed that we should somehow have the difference, and thus used the spl. markers '⁅ ⁆'; though the regular '[ ]' could've been used, as it has been used for other purposes in the print, I had thought of having a separate mark to avoid ambiguity.

Jim is requested to recall his opinion on the topic [as note 2 in L-12291.AB.revised_JF.txt, while I was working at MD last], wrt the status in MW.]

Now, what use did I have in mind for this marker in practice?

In case, if (and when) we decide to have the main text (i.e. the header portion) itself changed with the "padded strings" [as done in case of MW], the markers will come in handy.
We can programmatically match the strings in the k2-field (with marker) and in the following header portion (without marker) and change the header strings easily, and then remove the markers in the k2-field.

funderburkjim commented 3 months ago

temp_pw_9c.zip has the few changes mentioned by AB above.

change_9b_9c.txt has the changes.

This version is now installed at Cologne.

Additional revisions of repositories csl-corrections, csl-apidev, hwnorm1 (see commit links above).

The final version changes about 42000 lines out of 764942, or about 5% of lines. There are now about 12000 'alternate' headwords for pw. This work has taken about 6 weeks.

Now closing this issue. Will make a 'placeholder' issue for some additional TODOs.

sanskrit-lexicon / PWK

Alternate headwords for pw #106

temp_pw_9b.txt