Open funderburkjim opened 9 years ago
I thought it would be good to add the label 'research' to this issue. Although I looked at this github documentation, I wasn't able to create a new label 'research' for this repository. Maybe someone who knows how to create a new label, and who thinks 'research' would be a useful label, could do this.
@funderburkjim out of hundreds of issues, I would say this is number one. I will document which dictionaries use which conventions. That is itself a huge task. Can say KCH, that is not at Cologne has same as MW has. And vice versa PWG, PWK has "nest" style.
A similar issue relates to compound words
Indeed.
@drdhaval2785 is the only other person around, but I wish he continue work on dictionary conventions https://github.com/sanskrit-lexicon/CORRECTIONS/issues/43 So maybe it's just time for @parjanya to show his Linux skills.
Now we have a 'Research' label. @funderburkjim If you want to create a new label, click here
And then
@drdhaval2785 Re adding new label. Got it. I see where to click now. Thanks for the tip.
@funderburkjim and @gasyoun A modest beginning made in https://github.com/sanskrit-lexicon/alternateheadwords/tree/master/data/PWG.
#Step 4. Analysing ehw2.txt for correction codes.
Total 0 entries with code 0
Total 4607 entries with code 1
Total 0 entries with code 2
Total 3 entries with code 3
Total 50 entries with code 4
Total 6 entries with code 5
Total 69 entries with code 8
Total 2456 entries with code 9
Total 305 entries with code 10
Total 0 entries with code 99
So total 4607 entries (code 1) are those which already match a known headword in sanhw1.txt
धन्यवादः
han@antar@hantar@264804@2
hantar
? Was it not an upasarga?
Let's review one entry, han
.
As per now you have 4 upasargas:
han@anu@anuhan@264802@1 han@antar@hantar@264804@1 han@A@Ahan@264806@1 han@upod@upodhan@264808@9
There is, for example, no ati
in your list. But even if there would be, there is an issue. It's — ati, partic. °hata
, that means (because of °hata) that it can be atihata
, but not like in case of apajahi
.
Now I understood, you have taken L=122726 instead of L=115989. You have taken the supplements instead of the main entry. I remember that there was a list of pages which are the supplements. They might have additional upasargas, but mostly additional meanings.
aNg@(pari)@(pari)aNg@1375@10
(pari) -> pari
— pali (pari) caus. herumgehen lassen
pU@ati@atipU@100430@1 pU@anu@anupU@100432@1 pU@aBi@aBipU@100434@1 pU@A@ApU@100436@1 pU@samA@samApU@100438@1 pU@ud@utpU@100440@1 pU@ni@nipU@100442@9 pU@nis@nizpU@100444@1 pU@pratinis@pratinizpU@100446@1 pU@parA@parApU@100448@1 pU@pari@paripU@100450@1 pU@vi@vipU@100452@1 pU@sam@saMpU@100454@1 pU@aBisam@aBisaMpU@100456@1
Can we code that if , partic.
comes right next to an upasarga, that we note down that this form is only possible in particles? Not sure about L=78836, but guess mit pra vgl. prapavaṇa fg.
means that pra
is met as well, that means that supplements might indeed add upasargas.
jYA@ati@atijYA@159936@9 jYA@anu@anujYA@159938@1 jYA@pratyaByanu@pratyaByanujYA@159940@1 jYA@apa@apajYA@159942@1 jYA@aBi@aBijYA@159944@1 jYA@pratyaBi@pratyaBijYA@159946@1 jYA@ava@avajYA@159948@1 jYA@mAvajYa@mAvajYajYA@159948@10 jYA@mAvajAnIhi@mAvajAnIhijYA@159948@10 jYA@A@AjYA@159950@1 jYA@upa@upajYA@159952@1 jYA@nis@nirjYA@159954@9 jYA@nirjYAtamadgati@nirjYAtamadgati@159954@8 jYA@pari@parijYA@159956@1 jYA@pratipra@pratiprajYA@159958@1 jYA@a\nye@a\nyejYA@159958@10 jYA@vA\@vA\jYA@159958@10 jYA@vE@vEjYA@159958@10 jYA@ni\Dimagu^ptaM@ni\Dimagu^ptaMjYA@159958@10 jYA@vi\ndanti\@vi\ndanti\jYA@159958@10 jYA@na@jYAna@159958@1 jYA@vA\@vA\jYA@159958@10 jYA@prati\@prati\jYA@159958@10 jYA@prajA^nanti@prajA^nantijYA@159958@10 jYA@prati@pratijYA@159960@1 jYA@vi@vijYA@159962@1 jYA@prativi@prativijYA@159964@1 jYA@saMvi@saMvijYA@159966@1 jYA@jYAta@jYAta@159966@1 jYA@sam@saMjYA@159968@1
After manual check I think anu
is the first, where is ati
from? Did not locate. After anu
comes abhyanu
, but not in Dhaval's list. Next after is pratyabhyanu
, both in printed book and Dhaval's extraction.
sam caus.
means sam
is fixed only in caus.
Can we extract these caus., desid.?
hantar? Was it not an upasarga?
It still isn't. I don't know any book which says antar is an upasarga.
I don't know any book which says antar is an upasarga.
Ha, ok, I agree, but some parts of speech tend to get some functions over time and are rather similarly used.
@gasyoun A friendly advice - Use ehw3.txt and not 2.txt Some further refinements in case of sandhi is made in 3.txt
Let's review one entry, han. As per now you have 4 upasargas:
You seem to have half-read it.
There are two blobs of 'han'. Main entry
han@ati@atihan@250030@1 han@vyati@vyatihan@250032@1 han@anu@anuhan@250034@1 han@antar@antarhan@250036@1 han@apa@apahan@250038@1 han@vyapa@vyapahan@250040@1 han@api@apihan@250042@1 han@aBi@aBihan@250044@1 han@ava@avahan@250046@1 han@aDyava@aDyavahan@250048@1 han@anvava@anvavahan@250050@1 han@pratyava@pratyavahan@250052@1 han@A@Ahan@250054@1 han@apA@apAhan@250056@1 han@aByA@aByAhan@250058@1 han@udA@udAhan@250060@1 han@upA@upAhan@250062@1 han@pratyA@pratyAhan@250064@1 han@vyA@vyAhan@250066@1 han@prativyA@prativyAhan@250068@9 han@samA@samAhan@250070@1 han@ud@udhan@250072@9 han@upod@upodhan@250074@9 han@samud@samudhan@250076@9 han@upa@upahan@250078@1 han@samupa@samupahan@250080@9 han@ni@nihan@250082@1 han@aBini@aBinihan@250084@1 han@upani@upanihan@250086@1 han@pariRi@pariRihan@250088@1 han@praRi@praRihan@250090@1 han@pratini@pratinihan@250092@1 han@vini@vinihan@250094@1 han@saMni@saMnihan@250096@1 han@nis@nirhan@250098@1 han@atinis@atinirhan@250100@1 han@aDinis@aDinirhan@250102@1 han@parinis@parinirhan@250104@1 han@vinis@vinirhan@250106@9 han@parA@parAhan@250108@1 han@pari@parihan@250110@1 han@aBipari@aBiparihan@250112@1 han@pra@prahaR@250114@1 han@aBipra@aBiprahaR@250116@1 han@nipra@niprahaR@250118@1 han@vipra@viprahaR@250120@9 han@prati@pratihan@250122@1 han@saMprati@saMpratihan@250124@9 han@vi@vihan@250126@1 han@anuvi@anuvihan@250128@1 han@Avi@Avihan@250130@1 han@pravi@pravihaR@250132@9 han@prativi@prativihan@250134@9 han@sam@saMhan@250136@1 han@aBisam@aBisaMhan@250138@1 han@pratisam@pratisaMhan@250140@9 han@visam@visaMhan@250142@9
Supplement entry
han@anu@anuhan@264802@1 han@antar@antarhan@264804@1 han@A@Ahan@264806@1 han@upod@upodhan@264808@9
Use ehw3.txt and not 2.txt
Oh, understood. The main issues still remain.
There are two blobs of 'han'.
Great, maybe we should combine or interlink them? What way can we have them in the original order (should we?) that PWG has or shuffle?
Great, maybe we should combine or interlink them? What way can we have them in the original order (should we?) that PWG has or shuffle?
Thats for Jim to do. I have kept line numbers intact (lines from pwg.txt). So he must be able to do whatever magic he wants to.
TODO after preliminary scrutiny by @gasyoun
re jYA You saw only main entry and not the supplement entry. Supplement entry has ati very next. http://www.sanskrit-lexicon.uni-koeln.de/scans/PWGScan/2013/web/webtc/servepdf.php?page=5-1449
You saw only main entry and not the supplement entry. Supplement entry has ati very next.
Now I see, thanks.
Additionally to , partic.
I would note down act.
= parasmaipada and med.
= atmanepada.
caus. desid. intens.
Are of interest as well.
I agree that caus., desid., etc. would be good to mine. But just as well the individual verb forms.
The same goes even for MW, which has thousands of verb forms. The difficulty is dealing with
parsing all that information, including dealing with the abbreviated forms, e.g. -te
to give a common specific example.
Currently, such parsing is beyond us.
Currently, such parsing is beyond us.
Exactly, that is why I do not speak about
just as well the individual verb forms.
All I ask is
caus. desid. intens.
Because now it is inconsistent.
MW, which has thousands of verb forms.
Let it be. Till 2030 we do not care.
@gasyoun Please explain in detail how 'caus. desid. intens.' is inconsistent, and what you suggest to resolve the inconsistency.
caus.
Is bold now (in middle of text).
- Caus.
Is not bold (in beginning of new line). Make it bold.
desid. intens.
Nowhere bold now. Should be everywhere.
@gasyoun I need to see a snip of scan to understand the point you are making with respect to caus, desid and intens being bold or not bold. Please also indicate the headword and other details so I can find the example in the digitization.
I need to see a snip of scan to understand the point you are making with respect to caus, desid and intens being bold or not bold.
Scan has nothing. It's a meta-data added. No need to, because in book there is nothing.
I'm still trying to understand the point you are making.
Is the point that there we should add metadata so that we can pick out, from verb records, the causal, intensive, and desiderative forms from PWG ?
we should add metadata so that we can pick out, from verb records, the causal, intensive, and desiderative forms from PWG ?
Yes, as well, but picking up is not the most important part. We should make it easy to browse with eyes. And as it is partly implemented right now, it makes no sense to leave it half way. These are obvious, the other verb forms - harder to RegEx. I would grasp the low hanging fruit and forget the rest.
@funderburkjim after MW, it's always good to get back to PWG, right?
This issue was prompted by an observation of @zaaf2 in #150:
OBS: the entries of verbal roots in PWG have many subentries containing the various verbs derived with prefixes. I think it is indispensable be make such prefix-root-verbs searchable headwords, as in all the other dictionaries.
Some first thoughts:
If someone wants to take on this problem for one of the dictionaries, we can put our heads together to develop practical programming tactics.