Open funderburkjim opened 7 years ago
No results
hardly understand why.
In solving the shringara
problem, I made a separate version ('v1.0d', 'd' for development).
Before making this development version the standard version, I thought we should compare
results on some test cases (drawn from #156). This comment shows a comparison between the
two versions.
@gasyoun and others who may be interested: take a look at the differences and give your
opinion whether the dev version should be installed. My opinion is that the dev version is
probably better; but I want to take another close look and get opinions of others before pulling the
trigger.
; Compare test1 for v1.0 and v1.0d
; 35 Cases . 23 cases are same for v1.0 and v1.0d
; Case 1: vishnu mw (EQ)
v1.0 = vizRu
v1.0d = vizRu
; Case 2: vishnuh ap90 (EQ)
v1.0 =
v1.0d =
; Case 3: deva skd (EQ)
v1.0 = devaH,devaM,deva,devA
v1.0d = devaH,devaM,deva,devA
; Case 4: shringara mw (NEQ)
v1.0 =
v1.0d = SfNgAra,SfNgArA <<< note this is a good solution in the dev version
; Case 5: Sankara mw (NEQ)
v1.0 = Sakra,SAkra,sakara,saMkram,SAMkara,saMkara,SaMkarA,SaNkara,SaMkara,Sakara
v1.0d = Sakra,SAkra,sakAra,SakAra,sAkAra,sakara,saMkAra,saMkara,SAMkara,SaNkara,SaMkara,SaMkarA,Sakara
; Case 6: matri mw (NEQ)
v1.0 = mAtf
v1.0d = mAtra,mAtf,mAWara,maTara,maTra,natra,nAtra
; Case 7: krishna mw (EQ)
v1.0 = kfzRa,kfzRA
v1.0d = kfzRa,kfzRA
; Case 8: krushna mw (EQ)
v1.0 = kfzRa,kfzRA
v1.0d = kfzRa,kfzRA
; Case 9: punya mw (EQ)
v1.0 = puRya,puRyA
v1.0d = puRya,puRyA
; Case 10: kapa mw (EQ)
v1.0 = kaPa,kapa,kApA
v1.0d = kaPa,kapa,kApA
; Case 11: kafa mw (NEQ)
v1.0 = kara,Kara,kaPa,kAra,kArA,KarA,kapa,Karam,kAram,kApA,KAra
v1.0d = kara,Kara,kaPa,kAra,kArA,KarA,kapa,KAra,kApA
; Case 12: sanskrit mw (EQ)
v1.0 = saMskfta
v1.0d = saMskfta
; Case 13: acarya mw (NEQ)
v1.0 = Acarya
v1.0d = AcArya,AcAryA,Acarya
; Case 14: acarya ap90 (NEQ)
v1.0 =
v1.0d = AcAryaH
; Case 15: kut mw (EQ)
v1.0 = kUwa,kuTa,kuwa,kuw,kuWa,kuta,kuT,kUw,kut
v1.0d = kUwa,kuTa,kuwa,kuw,kuWa,kuta,kuT,kUw,kut
; Case 16: dukha mw (EQ)
v1.0 = DUka,DukA,Duka
v1.0d = DUka,DukA,Duka
; Case 17: dhaval mw (EQ)
v1.0 = Davala
v1.0d = Davala
; Case 18: ashvah mw (EQ)
v1.0 = aSva,asva,ASva
v1.0d = aSva,asva,ASva
; Case 19: hari mw (NEQ)
v1.0 = hari,hrI,harI
v1.0d = hara,hari,hAra,hrI,harI,hAri
; Case 20: karmman mw (NEQ)
v1.0 = karman,karuRam
v1.0d = karman,kArmaRa,karuRam
; Case 21: sangama mw (EQ)
v1.0 = sagaRa,sAMgama,sAgama,saMgama,saGana
v1.0d = sagaRa,sAMgama,sAgama,saMgama,saGana
; Case 22: aja mw (EQ)
v1.0 = aja,ajA,Aja,Ajan,AjA
v1.0d = aja,ajA,Aja,Ajan,AjA
; Case 23: manduka mw (NEQ)
v1.0 = maRqUka,maDuka,maDUka,mADUka,mADuka,mARqUka,maDukA,nAnduka,mAduka,maRquka
v1.0d = maRqUka,maDuka,maDUka,mADUka,mADuka,mARqUka,maDukA,maRquka,nAnduka,mAduka
; Case 24: go mw (EQ)
v1.0 = go
v1.0d = go
; Case 25: gai mw (EQ)
v1.0 = gE
v1.0d = gE
; Case 26: pook mw (EQ)
v1.0 = Puka,puka
v1.0d = Puka,puka
; Case 27: danda mw (EQ)
v1.0 = daRqa,dada,DanDa,DAnDA,daDan,dARqA,daRqA,dadA,daDa,dARqa
v1.0d = daRqa,dada,DanDa,DAnDA,daDan,dARqA,daRqA,dadA,daDa,dARqa
; Case 28: karma mw (NEQ)
v1.0 = karman,karRa,karuRA,karuRa,karuRam,karuma,Karma,karma
v1.0d = karman,karRa,karuRA,karuRa,kArma,Karma,karuma,karma,kArRa
; Case 29: kartri skd (NEQ)
v1.0 = kartrI
v1.0d = kAritA,karttA,kartra
; Case 30: atman skd (EQ)
v1.0 = AtmA
v1.0d = AtmA
; Case 31: ushas skd (NEQ)
v1.0 = uzA,Uza,UzaM,UzaH,UzA,uzaH,uza,uzaM <<< This may be case where v1.0 is better?
v1.0d = uzA,UzA
; Case 32: gunavat skd (EQ)
v1.0 = guRavAn
v1.0d = guRavAn
; Case 33: hanumat skd (EQ)
v1.0 = hanUmAn,hanumAn
v1.0d = hanUmAn,hanumAn
; Case 34: hanumat mw (EQ)
v1.0 = hanUmat,hanumat
v1.0d = hanUmat,hanumat
; Case 35: hanuman mw (EQ)
v1.0 = hanuman
v1.0d = hanuman
Jim, let's go for v1.0d.
Now v1.0 has the changes shown above.
The improvement is in the ordering of results. When comparing the improved version to the previous version for the 35 test suite cases shown above, 3 of the cases changed, and the changes are desireable.
Sankara mw
OLD: Sakra,SAkra,sakAra,SakAra,sAkAra,sakara,saMkAra,saMkara,SAMkara,
SaNkara,SaMkara,SaMkarA,Sakara
NEW: SaMkara,SaNkara,Sakra,saMkara,SAMkara,SAkra,sakAra,SakAra,saMkAra,
SaMkarA,sakara,sAkAra,Sakara
sangama mw
OLD: sagaRa,sAMgama,sAgama,saMgama,saGana
NEW: saMgama,sagaRa,sAMgama,sAgama,saGana
hanumat mw
OLD: hanUmat,hanumat
NEW: hanumat,hanUmat
In a word, normalized spelling. To elaborate: the variant selection process generates words with normalized spelling. I'll review the details of this normalization below. But the point is that when results are ordered, we are looking up the word frequencies of words with normalized spelling. However, the prior version of the word frequency data had about 7% of the words with un-normalized spellings, so we couldn't find the word frequencies for these words. The solution is to alter the word frequency data so that the words have normalized spellings. Out of 67050 word-frequency words, 4120 of the had a change in spelling; the list is in word_frequency_diff.txt.
The reason the 3 examples above have an improvement in ordering is that now their normalized spellings can be found in the word frequency file. For instance, in the original word frequency file we find 'hanumant'; the normalized spelling is 'hanumat', which is now found in the new word frequency file.
The rules are embedded in function normalize_key
of hwnorm1c.py.
Here is a paraphrase of the rules:
Mg
-> Ng
because 'N' is the nasal in the 'k K g G N` varga (all SLP1 spellings)karmman
-> karman
vfta
is the 'kta' of root 'vf'. Our rule confounds these two wordsvanaM
-> vana
aSvaH
-> aSva
guruH
-> guru
matiH
-> mati
pattra
-> patra
hanumant
-> hanumat
pracC
-> praC
.DIfferent dictionaries use various spelling conventions in presenting headwords. For instance, in AP90 dictionary, we find the headword matiH
, while in MW the corresponding headword is spelled mati
.
The hwnorm1c database keeps track of all the original spellings in various dictionaries of words which have the same normalized spelling.
For example, we find this among the 385,000 records of hwnorm1c:
mati: <NORMALIZED SPELLING>
Dictionaries whose headword is spelled 'mati'
BEN,BHS,BOP,BUR,CAE,CCS,GRA,INM,MD,MW,MW72,PE,PUI,PW,PWG,SHS,STC,VCP,WIL,YAT
Dictionaries whose headword is spelled 'matiH'
AP,AP90,SKD
In the generation of possible spelling variants with the simple search, one of the last steps is to generate normalized spellings of the generated variants. Then, using these normalized spellings, we search for the cases that appear in some dictionary; And finally we filter on the cases that occur in the particular dictionary the user has chosen in the UI.
One of the main steps in generating the normalized spelling is:
anusvara + consonant -> homorganic nasal + consonant
.
For example aMga
-> aNga
(using SLP1 spelling), aMca
-> aYca
, kaMqa
-> kaRqa
, etc.
However, there was a bug in the php program used in simple search for doing this step of the
normalization. What it was doing was replacing the anusvara with an empty string.
For example aMga
-> aga
, etc.
As a result of this error, the final simple search result would often contain extra variants with a missing nasal. Using the test suite, here are cases where this difference manifested:
Sankara mw
OLD: Sakra,SAkra,sakAra,SakAra,sAkAra,sakara,saMkAra,saMkara,
SAMkara,SaNkara,SaMkara,SaMkarA,Sakara
NEW: SaMkara,SaNkara,saMkara,SAMkara,saMkAra,SaMkarA
sangama mw
OLD: angama mw sagaRa,sAMgama,sAgama,saMgama,saGana
NEW: saMgama,sAMgama
manduka mw
OLD: maRqUka,maDuka,maDUka,mADUka,mADuka,mARqUka,maDukA,maRquka,nAnduka,mAduka
NEW: maRqUka,mARqUka,maRquka,nAnduka
danda mw
OLD: daRqa,dada,DanDa,DAnDA,daDan,dARqA,daRqA,dadA,daDa,dARqa
NEW: daRqa,DAnDA,DanDa,dARqa,daRqA,dARqA
One of the main steps
Longest to calculate? Most variants?
As a result of this error, the final simple search result would often contain extra variants with a missing nasal.
Understood, would be interesting if @drdhaval2785 would give us more interesting words to add to the test suite.
Goksheera 0 no results found [gokzIra:CAE,MW,PW,PWG]
Ignoring first capital letters still remains an issue.
vrtt
6 results: vṛt vṛta bhṛt varta bhṛta vṛtta
vriti
5 results: vṛti vartī bhṛti vṛtti varti
@SergeA should vrtt
same variations as vriti
as well?
@SergeA and @funderburkjim seem not to have open it in 5 years ))
alamkara
give an error
. Never seen errors before @funderburkjim
I do NOT find this error!
Are there any messages in the developer console??
shringara problem
No results. should get SfNgAra