verbs01 - Githubissues

funderburkjim commented 4 years ago

The verbs01 directory aims

to identify the entries in the Cappeller Sanskrit Wörterbuch which are verbs, and
to provide a correspondence between the headwords of these entries and verb entries of the Monier-Williams dictionary.
to identify the verb entries which further have upasargas, and to provide a correspondence between these upasargas and the prefixed verb entries of MW.

The comments here will focus on the ccs_preverb1 report.
ccs_preverb1_deva is a Devanagari version of the report.

Currently, 1009 of the 29986 entries of CCS are identifed as verbs. 484 of these verbs have upasargas, and a total of 2115 upasargas are identified.

All but 9 of the verbs are found to correspond with MW verbs. All but 129 of the upasargas are found to correspond with MW prefixed verbs.

funderburkjim commented 4 years ago

The report is organized according to the CCS entries identified as verbs; each such entry is considered a 'case':

;; Case 0001: L=191, k1=aNkay, k2=aNkay, code=V, #upasargas=0, mw=aNk (diff)

This record provides

L = the Cologne ID
k1 = the primary headword,
k2 = the full headword (usually same as k1)
a code, here always V
the number of upasargas identified within the ccs entry
the MW headword believed to correspond to this entry
- There are 9 cases (mw=?) where no correspondence currently identified.
a 'flag' comparing k1 to mw:
- (same) means the ccs headword spelling is the same as the spelling of the MW entry believed to correspond to the ccs entry (650 cases)
- (diff) means the k1 and mw spellings differ.(350 cases)

funderburkjim commented 4 years ago

preverb

When there are upasargas for a CCS entry, these are grouped below the case. Consider the verb 'tar' (über etwas setzen) (to cross over):

;; Case 0306: L=8281, k1=tar, k2=tar, code=V, #upasargas=11 (10/1), mw=tF (diff)
01        ava        tar               avatar                avatF yes ava+tF
02          A        tar                 Atar                  AtF yes A+tF
03         ud        tar                uttar                 uttF yes ud+tF
04       prod        tar              prottar               prottF yes pra+ud+tF
05      samud        tar             samuttar              samuttF yes sam+ud+tF
06         ni        tar                nitar                 nitF yes ni+tF
07        nis        tar               nistar                nistF yes nis+tF
08        pra        tar               pratar                pratF yes pra+tF
09      vipra        tar             vipratar              vipratF no 
10         vi        tar                vitar                 vitF yes vi+tF
11        sam        tar               saMtar                saMtF yes sam+tF

Note that 'tar' in CCS is said to correspond to 'tF' in MW. There are 11 upasargas found; 10 have been matched to MW prefixed verbs and one (vipra) has not been matched (that is, CCS has 'vipra' as upasarga for 'tF', but MW does not have a prefixed verb for 'tF' with prefix vipra; i.e., vipratF is not a prefixed verb in MW.

The listing for upasargas shows:

xx a sequence number for the upasargas for the verb
the upasarga
the verb
a likely spelling of the prefixed verb obtained by joining the upasarga with k1
a likely spelling of the prefixed verb obtained by joining the upasarga with the mw root spelling
yes/no indicating whether the prefixed verb is found as an entry in MW dictionary
When the prefixed verb is in MW, then a parsing is given of the mw prefixed verb spelling.

Currently, 1986 of the upasargas are identified with MW prefixed verb entries (search ' yes') and 129 are not identified with MW prefixed verb entries (search ' no').

funderburkjim commented 4 years ago

identification of verbs

In contrast to CAE, where previous verb identification markup was present, in CCS verbs must be identified by some other patterns. The basic pattern used is that, within the Devanagari text of an entry, there should appear a present tense 3rd person singular verb ending 'ti' or 'te'. The regex used is u'¦.*t[ie][,) ]*#}'.
The first line of the text should also NOT include a pattern indicating a noun or adverb. Also, several false-positive entries are excluded, in the ccs_verb_exclude.txt file.

It is possible that there are some verb entries in CCS that have been missed by the above pattern matching. However, a percentage comparison between CCS and CAE suggests that there are not many, if any, CCS verbs that have been missed by the pattern-matching method.

CCS 1009 / 29986 = 3.3% of entries identified as verbs
CAE 1078 / 40067 = 2.7% of entries identified as verbs.

But still it would be good to do an exclusion analysis for CCS (and CAE also, for that matter) to be more directly address the completeness of the verb identification.

funderburkjim commented 4 years ago

upasarga identification - the problem

There is no clear identification of upasargas within verb entries of CCS. Rather, upasargas only appear as Devanagari text. But there is also much other Devanagari text (such as different verb forms, participles, etc.) In this scan snippet (from CCS verb 'tar'), we see several Devanagari text instances, some being upasargas (or compound upasargas) and some being related non-upasarga Sanskrit words.

funderburkjim commented 4 years ago

upasarga identification - a solution

So the approach taken to identify upasargas within verb entries makes use of the list of upasargas that appear within the CAE dictionary. This list, in cae_upasargas.txt , contains 142 upasargas (the base upasargas along with various compound upasargas) that were previously identified as occurring within the verb entries of Cappeller's Sanskrit-English dictionary. In addition, this file contains 8 additional compound upasargas that were noticed to occur within one or another CCS entry.

Then, for a given verb entry of CCS , all the Devanagari words of the entry were examined, and those words appearing in the list of compound upasargas were considered to be the upasargas for that verb entry of CCS.

Further, this computed list of upasargas for each entry was manually compared with the underlying text of the CCS entry to confirm the list. The resulting list appears in the ccs_preverb0 file; this file is the basis of the upasargas of the ccs_preverb1 report.

gasyoun commented 4 years ago

There are 11 upasargas found; 10 have been matched to MW prefixed verbs and one (vipra) has not been matched (that is, CCS has 'vipra' as upasarga for 'tF', but MW does not have a prefixed verb for 'tF' with prefix vipra; i.e., vipratF is not a prefixed verb in MW.

Perfect explanation.

But still it would be good to do an exclusion analysis for CCS (and CAE also, for that matter) to be more directly address the completeness of the verb identification.

Let it be. We will get there one day.

contains 142 upasargas (the base upasargas along with various compound upasargas)

142+8, interesting. In 2015 @drdhaval2785 wrote I have a readymade list, made out of the upasargArthasiddhAntacandrikA: $upasarga_combinations = array("ati,atinis,atipra,ativi,ativyA,atisam,atyati,atyaBi,atyA,atyud,atyupa,aDi,aDini,aDinis,aDivi,aDyava,aDyA,aDyupa,anu,anuni,anunis,anuparA,anupari,anuparyA,anupra,anuprati,anuvi,anuvyava,anuvyA,anusam,anusampra,anUd,anvapa,anvava,anvA,apa,apani,apanis,apaparA,apaparyA,apapra,apavyA,apA,apAti,api,apipari,apod,apyati,aBi,aBini,aBinis,aBiparA,aBipari,aBiparyA,aBipra,aBivi,aBivyA,aBisamA,aBisam,aByati,aByaDi,aByanu,aByapa,aByava,aByA,aByudA,aByud,aByupa,aByupA,aByupAva,ava,avani,avA,A,utpra,udava,udA,ud,udvi,unni,upa,upani,upanis,upanyA,upapari,upaparyA,upapra,upavi,upavyA,upasaṁni,upasamA,upasam,upA,upAti,upAva,upodA,upod,upopa,duHsam,duranu,durava,durA,durud,durupa,durni,duzpari,duzpra,dus,ni,nipra,nirati,niraDi,niranu,nirapa,niraBi,niraBi,nirava,nirupA,nirvi,nivyA,nizpra,nisu,nis,nyA,parA,pari,parini,parinis,paripra,parivi,parivyA,parisam,paryaDi,paryanu,paryava,paryA,paryud,paryupa,pra,praNi,prati,pratini,pratinis,pratiparA,pratipari,pratipra,prativi,prativyA,pratisam,pratyaDi,pratyanu,pratyapa,pratyapi,pratyaBi,pratyava,pratyA,pratyudA,pratyud,pratyupa,pratyupA,pravi,pravyA,prasam,prA,prADi,prod,vi,vini,vinis,viparA,vipari,viparyA,vipra,viprati,visam,vyati,vyanu,vyanvA,vyapa,vyapA,vyaBi,vyava,vyA,vyud,vyupa,saṁvi,saṁvyava,saṁvyA,sanni,samati,samaDi,samanu,samanuvi,samanvA,samapa,samapi,samaBi,samaBivyA,samaBisam,samaBisampra,samaByava,samaByA,samaByud,samava,samava,samavA,samA,samudA,samud,samupa,samupA,sam,samparA,sampari,sampra,samprati,samprA,samprod,sampvari,su,supari,suvi,susamA,svanu,svaBi,svaByA,");

sanskrit-lexicon / CCS

verbs01 #1

preverb

identification of verbs

upasarga identification - the problem

upasarga identification - a solution