Alternate headwords for pw

funderburkjim commented 4 months ago

We tackle the task of generating alternate headwords for pw dictionary.

Preliminary outline of the approach:

Filter entries based on the first line of data (the line after the metaline)
Parse the implied headwords (based on the broken bar in that first line of entry)
- recognize also hom and (?) roots
Use this parse to construct k2 of the metaline base. When there is more than one headword, this will result in a comma-separated list in k2
construct parallel list of k1 from the list of k2.
construct pw_hwextra.txt (for csl-orig) from the k1-k2 list
- this will generate essentially duplicate entries in pw.xml for the extra 'alternate' headwords.

Note: no attempt to generate alternate headwords from upasargas of verb entries.

funderburkjim commented 4 months ago

pwkvn_4.CDSL.-prefix.splits.txt

I see that this differs from the current pwkvn portion of temp_pw_4.txt only in that it contains 433 new lines which begin with <div n="p">.
This revision certainly acceptable. I could post the next version temp_pw_X.txt:

Option 1: with just these div differences.
Option 2: with these div differences and also with changes indicated in files pwkvn_4.differences-1. -2. and -3, before posting temp_pw_4a.txt.

Which option are you expecting for my next posting?

Andhrabharati commented 4 months ago

Option 2.

funderburkjim commented 4 months ago

pwkvn_4_differences_1_notes.txt

@Andhrabharati Please review these 11, where I either have a question, or have deviated from your suggestion.

Andhrabharati commented 4 months ago

pwkvn_4_differences_1_notes.txt

I agree to all others except to these three items below--

20941 aBiDarmamahAviBAzA print change in bb-line?

Yes, as Eitel's Chinese Buddhism is given as the source and it has it thus.

35757 Why 'QuRQikA'? it is feminine of QuRQika

[Both QuRQika and QuRQikA are feminines.]

66824 NOT DONE: gajaSAstra appears in ACC. The pwk2-299-c has thus

And as Oppert Catalogue is given as the source, it has thus

Though ACC does have gajaSAstra, it does not "happen" to be related to Opp. Cat. 1!!

Andhrabharati commented 4 months ago

BTW, I draw your attention to the formatting that I have in my working files (as seen in the snippets above) that hugely helps 'visibly' identifying many markup errors in the data.

This feature of the EmEditor is quite useful, which is not present in the other editors (Notepad++ and Textpad) that I use for other purposes (where they outbeat the EmEditor).

funderburkjim commented 4 months ago

gajavEdyaSAstra

Your explanation convincing. Excellent that you have such comprehensive collection of resources to apply to such questions. Similarly your Sanskrit-Chinese dictionary resource.

Both QuRQika and QuRQikA are feminines.

Surely QuRQika is not feminine! I raised this question because it seems that inclusion of QuRQikA in k2 field is inconsistent with your exclusion of feminine forms in other cases (such as tripurARikA). While I went ahead with your exclusion in k2 of such feminine forms as tripurARikA, I think the 'rule' regarding inclusion/exclusion in k2 of such feminine forms is unclear, and its justification remains dubious. However, within the scope of our current objectives, this issue is relatively minor, so I'm not going to worry about it further.

Andhrabharati commented 4 months ago

Surely QuRQika is not feminine!

I do agree that it is not feminine, as per Skt. Grammar.

But here, pwk is citing Hemacandra's Prakrit Grammar, which does deviate from Skt. Grammar in many points!!

And the pwk entry (in expanded form) itself clearly gave the two forms--

ढुण्ढिक //in व्याकरणढुण्ढिक (probably read °ढुण्ढिका)// and ढुण्ढिका f. //in हैमप्राकृतढुण्ढिका.//

[Let me see, if I can locate this in the Pischel's edition (1877)]

I think the 'rule' regarding inclusion/exclusion in k2 of such feminine forms is unclear

I have opted to ignore the feminine forms (full or contracted) inside braces after adjectives, but not if they occur preceded by und/oder [eng. and/or] in the main text itself.

Andhrabharati commented 4 months ago

But here, pwk is citing Hemacandra's Prakrit Grammar, which does deviate from Skt. Grammar in many points!!

And the pwk entry (in expanded form) itself clearly gave the two forms--

ढुण्ढिक //in व्याकरणढुण्ढिक (probably read °ढुण्ढिका)// and ढुण्ढिका f. //in हैमप्राकृतढुण्ढिका.//

[Let me see, if I can locate this in the Pischel's edition (1877)]

After seeing that Pischel's edition (Hemacandra's Grammatik der Prâkritsprachen) has NO mention of this, I've looked at ACC, which has lead me to Bühler's Report (1877).

And, it has two entries, as under--

As Boethlingk has mentioned (in pwk), this could be a print error in the first entry in Bühler's Report!! Wonder why Boethlingk chose not to mention the 'source' at this entry.

[This has nothing to do with Prakrit Grammer (as I thought earlier); and is just a Skt. name of a work.]

funderburkjim commented 4 months ago

Option 2 work done. Result is temp_pw_5.zip

Cf. pw_5_work/readme.txt starting at Constructed from ../temp_pw_4a.txt in 3 steps.

Also cf. change_4a_5.txt.

Andhrabharati commented 4 months ago

AB missed these <L>204132<pc>3-256-b<k1>i<k2>3. i ? <L>205418<pc>4-296-b<k1>i<k2>3. i ? <L>206831<pc>5-249-c<k1>i<k2>3. i ? <L>208655<pc>6-298-b<k1>i<k2>3. i ? <L>214506<pc>7-320-d<k1>i<k2>3. i ?

Yes, I did miss these; though marked the same at L-201625!

Note: <L>222590<pc>7-389-d<k1>riktI Question !√{#riktI#}. cf. <L>94190<pc>5-189-b<k1>riktI

Yes, this is a wrong markup; and there is another such one in pwkvn portion, L-207311. Both these should be without √.

Andhrabharati commented 4 months ago

Here is the last set of differences in pwkvn portion--

pwkvn_5 differences.txt

Andhrabharati commented 4 months ago

Now coming to the pw-main portion.

Noted a total of 983 difference lines between CDSL and AB files.

Andhrabharati commented 4 months ago

Here is the pw5 diff. file--

pw_5 differences latest.txt

This time, I thought of giving the whole diff. file [without splitting into (3) different portions as done above in case of pw_4].

And hope, Jim would face no issues in "using" the same.

Andhrabharati commented 4 months ago

And request Jim to post the FULL file after necessary corrections are done in the CDSL file, for me to check again.

Andhrabharati commented 4 months ago

Noted a total of 983 difference lines between CDSL and AB files.

Here is the pw5 diff. file--

pw_5 differences latest.txt

Identified that due to some error in "filtering", about 500+ lines got skipped from being listed as difference lines.

To avoid confusion, taking that Jim might've already started looking at my pw_5.differences file, I am not posting a new file.

Many of these differences are in the header lines following the listed corrections in the metalines [mostly placement of the '¦']; and, hope Jim would be identifying those and does the required corrections.

Other differences can be "handled" once Jim finishes working with the present pw_5.differences file.

funderburkjim commented 4 months ago

mostly placement of the '¦'

OK, I'll examine the '¦' placement while going through the pw_5_diff file.

funderburkjim commented 3 months ago

pw_6

temp_pw_6.zip the revised pw.txt, version 6

Refer:

change_5_6.txt the changes from version 5
pw_6_work/readme.txt

pw5 differences

AB's pw_5.differences.latest.txt file had 3 kinds of 'records'

metabb (119) change to both metaline and broken-bar line
meta (520) change to metaline only
- I examined the bbline
- 309 matches for "change to bbline yes" changes made to bbline
- most of these involve placement of ¦ in the bbline. AB may disagree with some.
other (225) no change to metaline

In pw_6_work/readme.txt, roughly 90 cases noted where Jim disagrees with AB. There may be a few more such instances that are not noted in the readme.

Is the finish line of this marathon in sight?

Andhrabharati commented 3 months ago

Is the finish line of this marathon in sight?

Certainly yes, I think another three (or four) rounds would finish this (other than expanding the ~1000 Chr. groups); one: resolving the present 90 differences by AB, two: new differences to be posted by AB as mentioned above, three: Jim's adoptation of the second one and four: resolving the final differences (if any) from the third one by AB.

Andhrabharati commented 3 months ago

But I can take up the first task only after two days, as I am away from my data and computer now.

Jim may look at other issues in these two days, if he doesn't mind taking a short break from this task/issue.

funderburkjim commented 3 months ago

Sure. A break is good. Plenty of other things to do.

Andhrabharati commented 3 months ago

1246547-124-csAvairisolesAva_irisole sAva_irisole -> sAvairisole slp1 does not need _ for hiatus Note: mw: sAvaisirole MW print change?

;; AB note: MW has erred!!

Andhrabharati commented 3 months ago

Now, coming to the hiatus in slp1.

In normal conditions, it may not be required, as Jim mentioned. But, when people like AB come-in (a rare case indeed), to do a full reading (in a different transliteration, say iast or devanagari) things take a different aspect, when it is converted back to slp1.

Andhrabharati commented 3 months ago

Here are the ~90 entries with AB comments--

readme (AB remarks).txt

A major portion of these [70+] are so basic errors (which were left in AB's file) that Jim has wondered at a place-- "odd for AB"!

The reason is AB was working with an intention of taking up a full proofing of pwk data (which is recommended by Thomas also) once this phase is over, and did a mechanical work mostly in "padding" the alt. HWs (but not looking at the actual contracted forms having errors).

funderburkjim commented 3 months ago

temp_pw_6a.zip

Based on review of readme.AB.remarks.txt. Work is in pw_6a_work.

Only 7 lines of temp_pw_6a.txt differ from temp_pw_6.txt.

@Andhrabharati Are your next steps:

revise your version based on comparisons with temp_pw_6a
Do a diff between your revised version and temp_pw_6a

Andhrabharati commented 3 months ago

Work is in pw_6a_work.

NOTES to AB --- ;; And do we revert the few places where the sequence is recently changed to put the items in order? JIM: NO, do not make print change to restore alph. order ---

;; AB note: I was referring here to reverting of an entry in L-205237 (aBiDarma°mahAviBAzA) earlier relocated to be as per Eitel's (and alphabetical) order.

<L>124385<pc>7-121-b<k1>sAradIya {#nAmamAlA#} -> {#°nAmamAlA#} in bbline (the ° is needed, but missing in print ---

;; AB note: It is not '°', but '_' that is needed here! <L>124385<pc>7-121-b<k1>sAradIya<k2>sAradIya_nAmamAlA, (SAradIya_nAmamAlA)

Buhler (780) has it as

<L>124169<pc>7-118-a<k1>sAmAnyABAvagranTa<k2>sAmAnyABAvagranTa, ⁅#sAmAnyA⁆°BAvawippanI, ⁅#sAmAnyA⁆°BAvarahasya jim: The '#' is not needed in metaline

;; AB note: I had missed the # earlier and corrected now.

Andhrabharati commented 3 months ago

@Andhrabharati Are your next steps:

* revise your version based on comparisons with temp_pw_6a

* Do a diff between your revised version and  temp_pw_6a

@funderburkjim

Noted that you have just done the 'expansion' of the Chr. entities [though you mentioned about many other things 'to do', your mind seemed not to deviate from the pwk yet!!]; I would suggest you to integrate (and post) the same so that I can compare the same and we continue further from the same (in a step-by-step manner).

Andhrabharati commented 3 months ago

<L>63546<pc>4-034-c<k1>parAmarSavAda<k2>parAmarSavAda, parAmarSavAdahetuvicAra, parAmarSavAdArTa

AB: Probably we could use ⁅parAmarSa⁆°hetutAvicAra, and mark it as a print error. JIM: Don't make print change now. If OPP. CAT. 1. provides proof, then make a change

Oppert catalogue (south India) doesn't contain this item at all. It is in the Oudh manuscripts catalogue (north-west India), as mentioned in ACC.

And pwk clearly showed the Opp. cat. (as the source) for the two entities on either side, but not for this!!

Andhrabharati commented 3 months ago

Here is another (hopefully the final) set of differences in pwkvn portion-- diff_pwkvn_6a.txt

Andhrabharati commented 3 months ago

A new entry (in pwk main) that got merged into the prev. entry--

439521 (CDSL): <div n="p">— Mit {#vi#} {%zerfallen, zerbröckeln%}. {#mruc#} {#mro/cati#} {#gatyarTa#}.

(AB): <div n="p">— Mit {#vi#} {%zerfallen, zerbröckeln%}. (AB): <LEND> (AB): (AB): <L>89274.1<pc>5-113-a<k1>mruc<k2>mruc (AB): √{#mruc#}¦, {#mro/cati#} ({#gatyarTa#}).

This changes the no. of lines.

Andhrabharati commented 3 months ago

Here are the first two parts from the differences in pwk main-- diff_pwk v6a (metalines).txt

diff_pwk v6a (addl. upasarga split lines).txt This changes the no. of lines.

Andhrabharati commented 3 months ago

I would like to wait for the implementation of the above posted differences and also integrating the Chr. expansions by Jim into pwk_v6a, to post other differences (~1500 incl. the Chr. expansions, which would be reduced to about ~500 with Chr. expansion integration by Jim) from my file.

Andhrabharati commented 3 months ago

I would also like to suggest removing few vestigial lines, as we've reached a stage to get rid of them now--

extra blank lines before L-16301, L-55951, L-69765, L-69947, L-73145 (4 each)
extra blank line before <LEND> at L-29488, L-29489 (1 each)
line no.s 2 and 3 (at the beginning)

funderburkjim commented 3 months ago

versions 7a and 7b

temp_pw_7a.zip Aims to handle all AB suggested changes since version 6a.
temp_pw_7b.zip applies the 'Chr.' ls expansions to 7a.

Notes are in usual place: pw_7_work.

gasyoun commented 3 months ago

To watch this duo is mesmerising. @Andhrabharati @funderburkjim https://youtu.be/AEaA7hhGjCI

Andhrabharati commented 3 months ago

Now, we both have the pwkvn portion almost the same.

Started looking at the pwk_main differences without any filters.

Here are some corrections in the cdsl metalines,with Jim's principle of 'not having' the characters ⁆ and ° (at filling)--

L-2553 (aDyA°lohaka/rRa) -> (aDyAlohaka/rRa) L-2697 (aDaHprAkSA°yin) -> (aDaHprAkSAyin) L-5734 anuva°rtita/r⁆ -> anuvartita/r L-9317 (arGya°pAtra) -> (arGyapAtra) L-17001 (indu°puzpikA) -> (indupuzpikA) L-28460 °kuRqamaRqapasaMgraha -> kuRqamaRqapasaMgraha L-30793 (kISa°romA) -> (kISaromA) L-36650 AcAryaBadanta°gopadatta -> AcAryaBadantagopadatta L-37116 gOtamI°nandana⁆ -> gOtamInandana L-40099 citraSAkApUpaBakza°vikArakriyA -> citraSAkApUpaBakzavikArakriyA L-43613 (jyO°tsnI) -> (jyOtsnI) L-50701 dIrGa°varCikA -> dIrGavarCikA L-53543 (drO°Rakajihvi) -> (drORakajihvi) L-64055 °pariBAzopaskAra -> pariBAzopaskAra L-64848 (paryu°zaRAzwAhnikA) -> (paryuzaRAzwAhnikA) L-70023 pOtraM°jIvika⁆ -> pOtraMjIvika L-70112 (pOrRamAsyADi°karaRa) -> (pOrRamAsyADikaraRa) L-75465 °PalguRI -> PalguRI L-75658 (badarI°vanamAhAtmya) -> (badarIvanamAhAtmya) L-76593 °bahvfcabrAhmaRapaYcikABAzya -> bahvfcabrAhmaRapaYcikABAzya L-82705 maDyAntaviBAga°SAstra -> maDyAntaviBAgaSAstra L-89573 yajYaM_vo°Qave⁆ -> yajYaM_voQave L-95910 (lavana°sADikA) -> (lavanasADikA) L-97438 vajracCedikA_pra°jYApAramitA -> vajracCedikA_prajYApAramitA L-113109 (SE°rzAyaRa) -> (SErzAyaRa) L-124366 cintAmaRiH_sA°raRikA -> cintAmaRiH_sAraRikA L-125542 (sImAnta°dfSvan) -> (sImAntadfSvan) L-132776 su/a°vas, su/a°vaMs -> su/avas, su/avaMs

Also noted that while AB version consistently has the (...) entities in the metalines [almost 800 no.s] as per the print, Jim has just about 110 (...) entities.

Andhrabharati commented 3 months ago

Here are the complete 'unfiltered' difference files in the pwk_main portion--

diff pwk7b (metalines).txt [count: 684]

diff pwk7b (non-metalines).txt [count: 1466]

Now, we both have the pwkvn portion almost the same.

And, here are the complete 'unfiltered' difference files in the pwkvn portion--

diff pwkvn (metalines).txt [count: 26]

funderburkjim commented 3 months ago

versions 8 and 8a

Work is in pw_8_work directory.

temp_pw_8.zip all changes from AB's 3 files except those in diff_multiline1.txt
- change_7b_8.txt has the changes from previous 7b version.
temp_pw_8a.zip includes the additional diff_multiline1 changes.

Request @Andhrabharati to apply the changes in the two BEGIN Jim disagrees with AB for sections of the pw_8_work/readme.txt file. If you accept these, then I expect the 8a version will agree with your version.

Andhrabharati commented 3 months ago

@funderburkjim

Would you pl. have a look at this post, while I look at the pw_8a file?

Andhrabharati commented 3 months ago

Here are the 3 places where AB likes to debate with Jim's opinion.

AB: <L>124385<pc>7-121-b<k1>sAradIyanAmamAlA<k2>sAradIyanAmamAlA, (SAradIyanAmamAlA) Jim: <L>124385<pc>7-121-b<k1>sAradIya<k2>sAradIya, (SAradIya), SAradIyanAmamAlA Note: Also bbline change {#nAmamAlA#} -> {#°nAmamAlA#} PRINT CHANGE

;; AB remark: There is no "sAradIya" word that occurs in the literature; the word "SAradIya" has already been mentioned at L-111682, and as such there is no need to repeat the same again here. ;; AB remark: It is clearly the suggestion of Boethlingk to consider SAradIya for sAradIya (as a print error in BÜHLER Report) in sAradIyanAmamAlA which is a single word. ;; AB remark: "Adj. (f. {#A#})" is deleted here in this session, as it appeared redundant. ;; AB remark: And there is no need to put the ° mark, taking the entry as a single word. ----------------------------

{#aBizwipA/si#}¦ <ls>ṚV. 2,20,2</ls> nach <ls>GRASSMANN.</ls> für {#aBi/zwI pAsi#}. aBizwipA/si -> aBizwipA/(si) by print

;; AB remark: With the adopted norm that the entities having the in-text (...) and [...] be expanded with and without the brackets, this should've been made as an alt. HW group [aBizwipA/, aBizwipA/si]. ;; AB remark: However, this seems not the intent here; either it should go as just the "aBizwipA/" as taken by MW, or as "aBizwipA/si" as seen in the ṚV. citation and 'matching' with the GRA emendment. ;; AB remark: In either case, this would go as a "print change". ----------------------------

{#ISvarItantra#} <lex>n.</lex> und {#ISvare (<ab>Loc.</ab>) nityasuKAvasTApanam#}¦ Titel von Werken. {#ISvare (Loc.) nityasuKAvasTApanam#} -> {#ISvare#} (Loc.) {#nityasuKAvasTApanam#}

;; AB remark: I had felt that the two words (forming the name of the work) need not be separated as individual words, and as such marked thus.

Andhrabharati commented 3 months ago

Now, about the slp1 haitus places-- If Jim feels no need for these, in spite of my above post, I have no issues in having the hiatus removed at such places.

BTW, there is another place where it is not required, "it does exist" [at L-2991]!

Andhrabharati commented 3 months ago

My present version data has additional differences [in non-metalines] in pwk_main (few: ~150) and pwkvn (lot many: ~15k) portions; but the comparison could probably be stopped here.

Andhrabharati commented 3 months ago

On a 2nd thought, I have 'modified' both CDSL and AB files a bit; now, the difference line count is just over 700.

And, here are the modified files-- pw (CDSL) 8a.zip [This has few blank lines inserted]

pw integrated (AB) v1 (for CDSL).zip [This now has pwk main and vn portions integrated]

funderburkjim commented 3 months ago

I am unclear on your pw(CDSL)8a version
- how is it related to the temp_pw_8a version that I uploaded?
- What use should I make of it?
I suspect the pw.integrated version is your latest candidate for final version. Right?
- how different from your pw(CDSL)8a version ?

Each of these versions has 764942 lines. and my uploaded temp_pw_8a.txt has 764934 lines -- where do the extra 8 lines in your versions come from?

Also, in the vn section of both your versions, you omit the <info n="sup_X"/> field. This is needed for the displays to show the [supplement volume X] note .

Andhrabharati commented 3 months ago

I am unclear on your pw(CDSL)8a version

how is it related to the temp_pw_8a version that I uploaded?

What use should I make of it?

Yes, it is the same file with some changes done inside. It can be used to get the diff.s wrt the pw.integrated version; of course your original temp_8a file could also be used, but it will give more (500+) differences.

I suspect the pw.integrated version is your latest candidate for final version. Right?

how different from your pw(CDSL)8a version ?

Yes, for time being. [And I thought of not doing any more 'independent' updates in it from my side.] As mentioned above, it has some 700 differences wrt the pw(CDSL)8a

Each of these versions has 764942 lines. and my uploaded temp_pw_8a.txt has 764934 lines -- where do the extra 8 lines in your versions come from?

I have added extra blank lines after the <H> lines as were at the earlier versions of the pwkvn file, that were removed in your recent file(s).

Also, in the vn section of both your versions, you omit the <info n="sup_X"/> field. This is needed for the displays to show the [supplement volume X] note .

Do you want me to upload the files with the info tags retained as is?

funderburkjim commented 3 months ago

Do you want me to upload the files with the info tags retained as is?

If you can do that readily, then yes. Otherwise I can find a way to do it.

Andhrabharati commented 3 months ago

They are not immediately available; I need to spend a little time to make them.

Probably, it might be better if you do it yourself.

funderburkjim commented 3 months ago

Re 'L=124384' -- in your files, you have {#°nAmamAlA#} but you mention there is no need to put the ° mark.

re {#ISvare (<ab>Loc.</ab>) nityasuKAvasTApanam#}

This is not proper -- since <ab>Loc.</ab> is not Sanskrit. -- The abbreviation needs to be outside of {#...#}

funderburkjim commented 3 months ago

be better if you do it yourself.

OK, I'll do that.

Andhrabharati commented 3 months ago

Re 'L=124384' -- in your files, you have {#°nAmamAlA#} but you mention there is no need to put the ° mark.

My mistake; initially I had reverted my file line as in yours; but later posted the comments, but not corrected in my file accordingly.

This is how I wanted it to be-- <L>124385<pc>7-121-b<k1>sAradIyanAmamAlA<k2>⁅sAradIya⁆nAmamAlA, (⁅SAradIya⁆nAmamAlA) {#sAradIyanAmamAlA#} (besser {#SA°#})¦ <lex>f.</lex> Titel eines Werkes <ls>BÜHLER, Rep. No. 780</ls>. <LEND>

re {#ISvare (<ab>Loc.</ab>) nityasuKAvasTApanam#}

This is not proper -- since <ab>Loc.</ab> is not Sanskrit. -- The abbreviation needs to be outside of {#...#}

So, do we go with the two words separately marked as {#ISvare#} (Loc.) {#nityasuKAvasTApanam#}? [This is not a big point for me to debate upon.]

funderburkjim commented 3 months ago

do we go with the two words separately marked as {#ISvare#} (Loc.) {#nityasuKAvasTApanam#}?

Yes - I can't think of a better solution at the moment.

I've found the extra lines.

That's all my questions for now -- will proceed with analysis/implementation of your changes.

sanskrit-lexicon / PWK