sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

simple input: misc problems #171

Open funderburkjim opened 7 years ago

funderburkjim commented 7 years ago

shringara problem

No results. should get SfNgAra

gasyoun commented 7 years ago

No results

hardly understand why.

funderburkjim commented 7 years ago

First test suite

In solving the shringara problem, I made a separate version ('v1.0d', 'd' for development). Before making this development version the standard version, I thought we should compare results on some test cases (drawn from #156). This comment shows a comparison between the two versions.
@gasyoun and others who may be interested: take a look at the differences and give your opinion whether the dev version should be installed. My opinion is that the dev version is probably better; but I want to take another close look and get opinions of others before pulling the trigger.

; Compare test1 for v1.0  and v1.0d
; 35 Cases . 23 cases are same for v1.0  and v1.0d
; Case 1:  vishnu mw (EQ)
v1.0  = vizRu
v1.0d = vizRu
; Case 2:  vishnuh ap90 (EQ)
v1.0  = 
v1.0d = 
; Case 3:  deva skd (EQ)
v1.0  = devaH,devaM,deva,devA
v1.0d = devaH,devaM,deva,devA
; Case 4:  shringara mw (NEQ)
v1.0  = 
v1.0d = SfNgAra,SfNgArA    <<< note this is a good solution  in the dev version
; Case 5:  Sankara mw (NEQ)
v1.0  = Sakra,SAkra,sakara,saMkram,SAMkara,saMkara,SaMkarA,SaNkara,SaMkara,Sakara
v1.0d = Sakra,SAkra,sakAra,SakAra,sAkAra,sakara,saMkAra,saMkara,SAMkara,SaNkara,SaMkara,SaMkarA,Sakara
; Case 6:  matri mw (NEQ)
v1.0  = mAtf
v1.0d = mAtra,mAtf,mAWara,maTara,maTra,natra,nAtra
; Case 7:  krishna mw (EQ)
v1.0  = kfzRa,kfzRA
v1.0d = kfzRa,kfzRA
; Case 8:  krushna mw (EQ)
v1.0  = kfzRa,kfzRA
v1.0d = kfzRa,kfzRA
; Case 9:  punya mw (EQ)
v1.0  = puRya,puRyA
v1.0d = puRya,puRyA
; Case 10:  kapa mw (EQ)
v1.0  = kaPa,kapa,kApA
v1.0d = kaPa,kapa,kApA
; Case 11:  kafa mw (NEQ)
v1.0  = kara,Kara,kaPa,kAra,kArA,KarA,kapa,Karam,kAram,kApA,KAra
v1.0d = kara,Kara,kaPa,kAra,kArA,KarA,kapa,KAra,kApA
; Case 12:  sanskrit mw (EQ)
v1.0  = saMskfta
v1.0d = saMskfta
; Case 13:  acarya mw (NEQ)
v1.0  = Acarya
v1.0d = AcArya,AcAryA,Acarya
; Case 14:  acarya ap90 (NEQ)
v1.0  = 
v1.0d = AcAryaH
; Case 15:  kut mw (EQ)
v1.0  = kUwa,kuTa,kuwa,kuw,kuWa,kuta,kuT,kUw,kut
v1.0d = kUwa,kuTa,kuwa,kuw,kuWa,kuta,kuT,kUw,kut
; Case 16:  dukha mw (EQ)
v1.0  = DUka,DukA,Duka
v1.0d = DUka,DukA,Duka
; Case 17:  dhaval mw (EQ)
v1.0  = Davala
v1.0d = Davala
; Case 18:  ashvah mw (EQ)
v1.0  = aSva,asva,ASva
v1.0d = aSva,asva,ASva
; Case 19:  hari mw (NEQ)
v1.0  = hari,hrI,harI
v1.0d = hara,hari,hAra,hrI,harI,hAri
; Case 20:  karmman mw (NEQ)
v1.0  = karman,karuRam
v1.0d = karman,kArmaRa,karuRam
; Case 21:  sangama mw (EQ)
v1.0  = sagaRa,sAMgama,sAgama,saMgama,saGana
v1.0d = sagaRa,sAMgama,sAgama,saMgama,saGana
; Case 22:  aja mw (EQ)
v1.0  = aja,ajA,Aja,Ajan,AjA
v1.0d = aja,ajA,Aja,Ajan,AjA
; Case 23:  manduka mw (NEQ)
v1.0  = maRqUka,maDuka,maDUka,mADUka,mADuka,mARqUka,maDukA,nAnduka,mAduka,maRquka
v1.0d = maRqUka,maDuka,maDUka,mADUka,mADuka,mARqUka,maDukA,maRquka,nAnduka,mAduka
; Case 24:  go mw (EQ)
v1.0  = go
v1.0d = go
; Case 25:  gai mw (EQ)
v1.0  = gE
v1.0d = gE
; Case 26:  pook mw (EQ)
v1.0  = Puka,puka
v1.0d = Puka,puka
; Case 27:  danda mw (EQ)
v1.0  = daRqa,dada,DanDa,DAnDA,daDan,dARqA,daRqA,dadA,daDa,dARqa
v1.0d = daRqa,dada,DanDa,DAnDA,daDan,dARqA,daRqA,dadA,daDa,dARqa
; Case 28:  karma mw (NEQ)
v1.0  = karman,karRa,karuRA,karuRa,karuRam,karuma,Karma,karma
v1.0d = karman,karRa,karuRA,karuRa,kArma,Karma,karuma,karma,kArRa
; Case 29:  kartri skd (NEQ)
v1.0  = kartrI
v1.0d = kAritA,karttA,kartra
; Case 30:  atman skd (EQ)
v1.0  = AtmA
v1.0d = AtmA
; Case 31:  ushas skd (NEQ)
v1.0  = uzA,Uza,UzaM,UzaH,UzA,uzaH,uza,uzaM    <<< This may be case where v1.0 is better?
v1.0d = uzA,UzA
; Case 32:  gunavat skd (EQ)
v1.0  = guRavAn
v1.0d = guRavAn
; Case 33:  hanumat skd (EQ)
v1.0  = hanUmAn,hanumAn
v1.0d = hanUmAn,hanumAn
; Case 34:  hanumat mw (EQ)
v1.0  = hanUmat,hanumat
v1.0d = hanUmat,hanumat
; Case 35:  hanuman mw (EQ)
v1.0  = hanuman
v1.0d = hanuman
gasyoun commented 7 years ago

Jim, let's go for v1.0d.

funderburkjim commented 7 years ago

v1.0d installed

Now v1.0 has the changes shown above.

funderburkjim commented 7 years ago

word_frequency improvement: normalizing spelling

The improvement is in the ordering of results. When comparing the improved version to the previous version for the 35 test suite cases shown above, 3 of the cases changed, and the changes are desireable.

Sankara mw
OLD: Sakra,SAkra,sakAra,SakAra,sAkAra,sakara,saMkAra,saMkara,SAMkara,
        SaNkara,SaMkara,SaMkarA,Sakara
NEW: SaMkara,SaNkara,Sakra,saMkara,SAMkara,SAkra,sakAra,SakAra,saMkAra,
          SaMkarA,sakara,sAkAra,Sakara

sangama mw
OLD:  sagaRa,sAMgama,sAgama,saMgama,saGana
NEW: saMgama,sagaRa,sAMgama,sAgama,saGana

hanumat mw
OLD: hanUmat,hanumat
NEW: hanumat,hanUmat

What was changed?

In a word, normalized spelling. To elaborate: the variant selection process generates words with normalized spelling. I'll review the details of this normalization below. But the point is that when results are ordered, we are looking up the word frequencies of words with normalized spelling. However, the prior version of the word frequency data had about 7% of the words with un-normalized spellings, so we couldn't find the word frequencies for these words. The solution is to alter the word frequency data so that the words have normalized spellings. Out of 67050 word-frequency words, 4120 of the had a change in spelling; the list is in word_frequency_diff.txt.

The reason the 3 examples above have an improvement in ordering is that now their normalized spellings can be found in the word frequency file. For instance, in the original word frequency file we find 'hanumant'; the normalized spelling is 'hanumat', which is now found in the new word frequency file.

funderburkjim commented 7 years ago

Review of normalized spelling rules

The rules are embedded in function normalize_key of hwnorm1c.py. Here is a paraphrase of the rules:

purpose of hwnorm1c database

DIfferent dictionaries use various spelling conventions in presenting headwords. For instance, in AP90 dictionary, we find the headword matiH, while in MW the corresponding headword is spelled mati.

The hwnorm1c database keeps track of all the original spellings in various dictionaries of words which have the same normalized spelling.

For example, we find this among the 385,000 records of hwnorm1c:

mati:  <NORMALIZED SPELLING>

  Dictionaries whose headword is spelled 'mati'
 BEN,BHS,BOP,BUR,CAE,CCS,GRA,INM,MD,MW,MW72,PE,PUI,PW,PWG,SHS,STC,VCP,WIL,YAT

  Dictionaries whose headword is spelled 'matiH'
 AP,AP90,SKD
funderburkjim commented 7 years ago

Simple search normalization bug corrected

In the generation of possible spelling variants with the simple search, one of the last steps is to generate normalized spellings of the generated variants. Then, using these normalized spellings, we search for the cases that appear in some dictionary; And finally we filter on the cases that occur in the particular dictionary the user has chosen in the UI.

One of the main steps in generating the normalized spelling is: anusvara + consonant -> homorganic nasal + consonant. For example aMga -> aNga (using SLP1 spelling), aMca -> aYca, kaMqa -> kaRqa, etc.

However, there was a bug in the php program used in simple search for doing this step of the normalization. What it was doing was replacing the anusvara with an empty string. For example aMga -> aga, etc.

As a result of this error, the final simple search result would often contain extra variants with a missing nasal. Using the test suite, here are cases where this difference manifested:

Sankara mw 
OLD: Sakra,SAkra,sakAra,SakAra,sAkAra,sakara,saMkAra,saMkara,
         SAMkara,SaNkara,SaMkara,SaMkarA,Sakara
NEW: SaMkara,SaNkara,saMkara,SAMkara,saMkAra,SaMkarA

sangama mw 
OLD: angama mw sagaRa,sAMgama,sAgama,saMgama,saGana
NEW: saMgama,sAMgama

manduka mw 
OLD: maRqUka,maDuka,maDUka,mADUka,mADuka,mARqUka,maDukA,maRquka,nAnduka,mAduka
NEW: maRqUka,mARqUka,maRquka,nAnduka

danda mw 
OLD: daRqa,dada,DanDa,DAnDA,daDan,dARqA,daRqA,dadA,daDa,dARqa
NEW: daRqa,DAnDA,DanDa,dARqa,daRqA,dARqA
gasyoun commented 7 years ago

One of the main steps

Longest to calculate? Most variants?

As a result of this error, the final simple search result would often contain extra variants with a missing nasal.

Understood, would be interesting if @drdhaval2785 would give us more interesting words to add to the test suite.

gasyoun commented 6 years ago

Goksheera 0 no results found [gokzIra:CAE,MW,PW,PWG]

Ignoring first capital letters still remains an issue.

gasyoun commented 6 years ago

vrtt

6 results: vṛt vṛta bhṛt varta bhṛta vṛtta

vriti

5 results: vṛti vartī bhṛti vṛtti varti

@SergeA should vrtt same variations as vriti as well?

gasyoun commented 1 year ago

@SergeA and @funderburkjim seem not to have open it in 5 years ))

gasyoun commented 1 year ago

уккщк

alamkara give an error. Never seen errors before @funderburkjim

funderburkjim commented 1 year ago

I do NOT find this error!

image

Are there any messages in the developer console??