Dictionaries with prefixed forms under the root headword

funderburkjim commented 9 years ago

This issue was prompted by an observation of @zaaf2 in #150:

OBS: the entries of verbal roots in PWG have many subentries containing the various verbs derived with prefixes. I think it is indispensable be make such prefix-root-verbs searchable headwords, as in all the other dictionaries.

Some first thoughts:

This would be an excellent enhancement to PWG
There are other dictionaries which follow the convention of PWG, for instance WIL. I am not sure of exactly which dictionaries follow which conventions.
A similar issue relates to compound words. For instance, STC (and other dictionaries, which?) under 'DUma' puts many compound forms (DUma-ketana, etc.) whereas MW (and other dictionaries, which?) have the compounds as separate headwords.
Contemplation of solving this problem from the point of view of an idiot (which a computer program is, alas) leads me to believe that the problem is computationally difficult. Not impossible, but substantial.

If someone wants to take on this problem for one of the dictionaries, we can put our heads together to develop practical programming tactics.

funderburkjim commented 9 years ago

I thought it would be good to add the label 'research' to this issue. Although I looked at this github documentation, I wasn't able to create a new label 'research' for this repository. Maybe someone who knows how to create a new label, and who thinks 'research' would be a useful label, could do this.

gasyoun commented 9 years ago

@funderburkjim out of hundreds of issues, I would say this is number one. I will document which dictionaries use which conventions. That is itself a huge task. Can say KCH, that is not at Cologne has same as MW has. And vice versa PWG, PWK has "nest" style.

A similar issue relates to compound words

Indeed.

@drdhaval2785 is the only other person around, but I wish he continue work on dictionary conventions https://github.com/sanskrit-lexicon/CORRECTIONS/issues/43 So maybe it's just time for @parjanya to show his Linux skills.

drdhaval2785 commented 8 years ago

Now we have a 'Research' label. @funderburkjim If you want to create a new label, click here capture

And then capture

funderburkjim commented 8 years ago

@drdhaval2785 Re adding new label. Got it. I see where to click now. Thanks for the tip.

drdhaval2785 commented 8 years ago

@funderburkjim and @gasyoun A modest beginning made in https://github.com/sanskrit-lexicon/alternateheadwords/tree/master/data/PWG.

#Step 4. Analysing ehw2.txt for correction codes.
Total 0 entries with code 0
Total 4607 entries with code 1
Total 0 entries with code 2
Total 3 entries with code 3
Total 50 entries with code 4
Total 6 entries with code 5
Total 69 entries with code 8
Total 2456 entries with code 9
Total 305 entries with code 10
Total 0 entries with code 99

So total 4607 entries (code 1) are those which already match a known headword in sanhw1.txt

gasyoun commented 8 years ago

धन्यवादः

han@antar@hantar@264804@2

antar

hantar? Was it not an upasarga?

gasyoun commented 8 years ago

Let's review one entry, han.

As per now you have 4 upasargas:

han@anu@anuhan@264802@1 han@antar@hantar@264804@1 han@A@Ahan@264806@1 han@upod@upodhan@264808@9

There is, for example, no ati in your list. But even if there would be, there is an issue. It's — ati, partic. °hata, that means (because of °hata) that it can be atihata, but not like in case of apajahi. Now I understood, you have taken L=122726 instead of L=115989. You have taken the supplements instead of the main entry. I remember that there was a list of pages which are the supplements. They might have additional upasargas, but mostly additional meanings.

gasyoun commented 8 years ago

aNg@(pari)@(pari)aNg@1375@10

(pari) -> pari

— pali (pari) caus. herumgehen lassen

gasyoun commented 8 years ago

pU@ati@atipU@100430@1 pU@anu@anupU@100432@1 pU@aBi@aBipU@100434@1 pU@A@ApU@100436@1 pU@samA@samApU@100438@1 pU@ud@utpU@100440@1 pU@ni@nipU@100442@9 pU@nis@nizpU@100444@1 pU@pratinis@pratinizpU@100446@1 pU@parA@parApU@100448@1 pU@pari@paripU@100450@1 pU@vi@vipU@100452@1 pU@sam@saMpU@100454@1 pU@aBisam@aBisaMpU@100456@1

Can we code that if , partic. comes right next to an upasarga, that we note down that this form is only possible in particles? Not sure about L=78836, but guess mit pra vgl. prapavaṇa fg. means that pra is met as well, that means that supplements might indeed add upasargas.

gasyoun commented 8 years ago

jYA@ati@atijYA@159936@9 jYA@anu@anujYA@159938@1 jYA@pratyaByanu@pratyaByanujYA@159940@1 jYA@apa@apajYA@159942@1 jYA@aBi@aBijYA@159944@1 jYA@pratyaBi@pratyaBijYA@159946@1 jYA@ava@avajYA@159948@1 jYA@mAvajYa@mAvajYajYA@159948@10 jYA@mAvajAnIhi@mAvajAnIhijYA@159948@10 jYA@A@AjYA@159950@1 jYA@upa@upajYA@159952@1 jYA@nis@nirjYA@159954@9 jYA@nirjYAtamadgati@nirjYAtamadgati@159954@8 jYA@pari@parijYA@159956@1 jYA@pratipra@pratiprajYA@159958@1 jYA@a\nye@a\nyejYA@159958@10 jYA@vA\@vA\jYA@159958@10 jYA@vE@vEjYA@159958@10 jYA@ni\Dimagu^ptaM@ni\Dimagu^ptaMjYA@159958@10 jYA@vi\ndanti\@vi\ndanti\jYA@159958@10 jYA@na@jYAna@159958@1 jYA@vA\@vA\jYA@159958@10 jYA@prati\@prati\jYA@159958@10 jYA@prajA^nanti@prajA^nantijYA@159958@10 jYA@prati@pratijYA@159960@1 jYA@vi@vijYA@159962@1 jYA@prativi@prativijYA@159964@1 jYA@saMvi@saMvijYA@159966@1 jYA@jYAta@jYAta@159966@1 jYA@sam@saMjYA@159968@1

After manual check I think anu is the first, where is ati from? Did not locate. After anu comes abhyanu, but not in Dhaval's list. Next after is pratyabhyanu, both in printed book and Dhaval's extraction. sam caus. means sam is fixed only in caus. Can we extract these caus., desid.?

drdhaval2785 commented 8 years ago

hantar? Was it not an upasarga?

It still isn't. I don't know any book which says antar is an upasarga.

gasyoun commented 8 years ago

I don't know any book which says antar is an upasarga.

Ha, ok, I agree, but some parts of speech tend to get some functions over time and are rather similarly used.

drdhaval2785 commented 8 years ago

@gasyoun A friendly advice - Use ehw3.txt and not 2.txt Some further refinements in case of sandhi is made in 3.txt

drdhaval2785 commented 8 years ago

Let's review one entry, han. As per now you have 4 upasargas:

You seem to have half-read it.

There are two blobs of 'han'. Main entry

han@ati@atihan@250030@1 han@vyati@vyatihan@250032@1 han@anu@anuhan@250034@1 han@antar@antarhan@250036@1 han@apa@apahan@250038@1 han@vyapa@vyapahan@250040@1 han@api@apihan@250042@1 han@aBi@aBihan@250044@1 han@ava@avahan@250046@1 han@aDyava@aDyavahan@250048@1 han@anvava@anvavahan@250050@1 han@pratyava@pratyavahan@250052@1 han@A@Ahan@250054@1 han@apA@apAhan@250056@1 han@aByA@aByAhan@250058@1 han@udA@udAhan@250060@1 han@upA@upAhan@250062@1 han@pratyA@pratyAhan@250064@1 han@vyA@vyAhan@250066@1 han@prativyA@prativyAhan@250068@9 han@samA@samAhan@250070@1 han@ud@udhan@250072@9 han@upod@upodhan@250074@9 han@samud@samudhan@250076@9 han@upa@upahan@250078@1 han@samupa@samupahan@250080@9 han@ni@nihan@250082@1 han@aBini@aBinihan@250084@1 han@upani@upanihan@250086@1 han@pariRi@pariRihan@250088@1 han@praRi@praRihan@250090@1 han@pratini@pratinihan@250092@1 han@vini@vinihan@250094@1 han@saMni@saMnihan@250096@1 han@nis@nirhan@250098@1 han@atinis@atinirhan@250100@1 han@aDinis@aDinirhan@250102@1 han@parinis@parinirhan@250104@1 han@vinis@vinirhan@250106@9 han@parA@parAhan@250108@1 han@pari@parihan@250110@1 han@aBipari@aBiparihan@250112@1 han@pra@prahaR@250114@1 han@aBipra@aBiprahaR@250116@1 han@nipra@niprahaR@250118@1 han@vipra@viprahaR@250120@9 han@prati@pratihan@250122@1 han@saMprati@saMpratihan@250124@9 han@vi@vihan@250126@1 han@anuvi@anuvihan@250128@1 han@Avi@Avihan@250130@1 han@pravi@pravihaR@250132@9 han@prativi@prativihan@250134@9 han@sam@saMhan@250136@1 han@aBisam@aBisaMhan@250138@1 han@pratisam@pratisaMhan@250140@9 han@visam@visaMhan@250142@9

Supplement entry

han@anu@anuhan@264802@1 han@antar@antarhan@264804@1 han@A@Ahan@264806@1 han@upod@upodhan@264808@9

gasyoun commented 8 years ago

Use ehw3.txt and not 2.txt

Oh, understood. The main issues still remain.

gasyoun commented 8 years ago

There are two blobs of 'han'.

Great, maybe we should combine or interlink them? What way can we have them in the original order (should we?) that PWG has or shuffle?

drdhaval2785 commented 8 years ago

Great, maybe we should combine or interlink them? What way can we have them in the original order (should we?) that PWG has or shuffle?

Thats for Jim to do. I have kept line numbers intact (lines from pwg.txt). So he must be able to do whatever magic he wants to.

drdhaval2785 commented 8 years ago

TODO after preliminary scrutiny by @gasyoun

Can we code that if , partic. comes right next to an upasarga, that we note down that this form is only possible in particles?
Can we extract these caus., desid.?

drdhaval2785 commented 8 years ago

re jYA You saw only main entry and not the supplement entry. Supplement entry has ati very next. http://www.sanskrit-lexicon.uni-koeln.de/scans/PWGScan/2013/web/webtc/servepdf.php?page=5-1449

gasyoun commented 8 years ago

You saw only main entry and not the supplement entry. Supplement entry has ati very next.

Now I see, thanks.

Additionally to , partic. I would note down act. = parasmaipada and med. = atmanepada.

caus. desid. intens.

Are of interest as well.

funderburkjim commented 8 years ago

I agree that caus., desid., etc. would be good to mine. But just as well the individual verb forms.

The same goes even for MW, which has thousands of verb forms. The difficulty is dealing with parsing all that information, including dealing with the abbreviated forms, e.g. -te to give a common specific example.

Currently, such parsing is beyond us.

gasyoun commented 8 years ago

Currently, such parsing is beyond us.

Exactly, that is why I do not speak about

just as well the individual verb forms.

All I ask is

caus. desid. intens.

Because now it is inconsistent.

MW, which has thousands of verb forms.

Let it be. Till 2030 we do not care.

funderburkjim commented 8 years ago

@gasyoun Please explain in detail how 'caus. desid. intens.' is inconsistent, and what you suggest to resolve the inconsistency.

gasyoun commented 8 years ago

caus.

Is bold now (in middle of text).

Caus.

Is not bold (in beginning of new line). Make it bold.

desid. intens.

Nowhere bold now. Should be everywhere.

funderburkjim commented 8 years ago

@gasyoun I need to see a snip of scan to understand the point you are making with respect to caus, desid and intens being bold or not bold. Please also indicate the headword and other details so I can find the example in the digitization.

gasyoun commented 8 years ago

I need to see a snip of scan to understand the point you are making with respect to caus, desid and intens being bold or not bold.

Scan has nothing. It's a meta-data added. No need to, because in book there is nothing.

funderburkjim commented 8 years ago

I'm still trying to understand the point you are making.

Is the point that there we should add metadata so that we can pick out, from verb records, the causal, intensive, and desiderative forms from PWG ?

gasyoun commented 8 years ago

we should add metadata so that we can pick out, from verb records, the causal, intensive, and desiderative forms from PWG ?

Yes, as well, but picking up is not the most important part. We should make it easy to browse with eyes. And as it is partly implemented right now, it makes no sense to leave it half way. These are obvious, the other verb forms - harder to RegEx. I would grasp the low hanging fruit and forget the rest.

gasyoun commented 6 years ago

@funderburkjim after MW, it's always good to get back to PWG, right?

sanskrit-lexicon / CORRECTIONS

Dictionaries with prefixed forms under the root headword #161