sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

simple-sanskrit search #156

Open funderburkjim opened 7 years ago

funderburkjim commented 7 years ago

This continues the research begun in #8.

The url of the version 0.1 is this.

This takes into account word frequency, and uses the MW dictionary instead of WIlson.

I think this pretty well represents the work that Ilya and Marcis began.

@gasyoun Agree?

If so, we can start tinkering with the algorithm that generates the alternates.

gasyoun commented 7 years ago

When I entered BakzaMkAra all I got was Loading... When just sankara 720 variants - wow, quite a few.

zaMkara 200 51
saMkara 200 36
saMkAra 200 1
zaMkarA 200 0
zAMkara 200 0
Sankara 404 -1
SankarA 404 -1
SankaRa 404 -1
zAGkhARRA 404 -1
zAGkhARa 404 -1
zAGkhARRa 404 -1
SankaRA 404 -1
zAGkhARA 404 -1
SankaRRA 404 -1
SankARRA 404 -1
Sankhara 404 -1
SankharA 404 -1
SankhaRa 404 -1
SankARRa 404 -1
SankARA 404 -1
zAGkhArA 404 -1
SankAra 404 -1
SankArA 404 -1
SankARa 404 -1
SankaRRa 404 -1
zAGkhaRRA 404 -1
zAGkaRA 404 -1
zAGkaRRa 404 -1
zAGkaRRA 404 -1
zAGkAra 404 -1
zAGkaRa 404 -1
zAGkarA 404 -1
zAJkhARRa 404 -1
zAJkhARRA 404 -1
zAGkara 404 -1
zAGkArA 404 -1
zAGkARa 404 -1
zAGkhaRa 404 -1
zAGkhaRA 404 -1
zAGkhaRRa 404 -1
SankhaRA 404 -1
zAGkharA 404 -1
zAGkhara 404 -1
zAGkARA 404 -1
zAGkARRa 404 -1
zAGkARRA 404 -1
zAGkhAra 404 -1
SankhAra 404 -1
SaMkhARA 404 -1
SaMkhARRa 404 -1
SaMkhARRA 404 -1
SaNkara 404 -1
SaMkhARa 404 -1
SaMkhArA 404 -1
SaMkhaRA 404 -1
SaMkhaRRa 404 -1
SaMkhaRRA 404 -1
SaMkhAra 404 -1
SaNkarA 404 -1
SaNkaRa 404 -1
SaNkARa 404 -1
SaNkARA 404 -1
SaNkARRa 404 -1
SaNkARRA 404 -1
SaNkArA 404 -1
SaNkAra 404 -1
SaNkaRA 404 -1
SaNkaRRa 404 -1
SaNkaRRA 404 -1
SaMkhaRa 404 -1
SaMkharA 404 -1
SankhARRa 404 -1
SankhARRA 404 -1
SaMkara 404 -1
SaMkarA 404 -1
SankhARA 404 -1
SankhARa 404 -1
SankhaRRA 404 -1
zAJkhARA 404 -1
SankhArA 404 -1
SaMkaRa 404 -1
SaMkaRA 404 -1
SaMkARA 404 -1
SaMkARRa 404 -1
SaMkARRA 404 -1
SaMkhara 404 -1
SaMkARa 404 -1
SaMkArA 404 -1
SaMkaRRa 404 -1
SaMkaRRA 404 -1
SaMkAra 404 -1
SankhaRRa 404 -1
zAJkhAra 404 -1
zAMkARa 404 -1
zAMkARA 404 -1
zAMkARRa 404 -1
zAMkARRA 404 -1
zAMkArA 404 -1
zAMkAra 404 -1
zAMkaRa 404 -1
zAMkaRA 404 -1
zAMkaRRa 404 -1
zAMkaRRA 404 -1
zAMkhara 404 -1
zAMkharA 404 -1
zAMkhArA 404 -1
zAMkhARa 404 -1
zAMkhARA 404 -1
zAMkhARRa 404 -1
zAMkhAra 404 -1
zAMkhaRRA 404 -1
zAMkhaRa 404 -1
zAMkhaRA 404 -1
zAMkhaRRa 404 -1
zAMkarA 404 -1
zAnkhARRA 404 -1
zAnkARa 404 -1
zAnkARA 404 -1
zAnkARRa 404 -1
zAnkARRA 404 -1
zAnkArA 404 -1
zAnkAra 404 -1
zAnkaRA 404 -1
zAnkaRRa 404 -1
zAnkaRRA 404 -1
zAnkhara 404 -1
zAnkharA 404 -1
zAnkhArA 404 -1
zAnkhARa 404 -1
zAnkhARA 404 -1
zAnkhARRa 404 -1
zAnkhAra 404 -1
zAnkhaRRA 404 -1
zAnkhaRa 404 -1
zAnkhaRA 404 -1
zAnkhaRRa 404 -1
zAMkhARRA 404 -1
zANkara 404 -1
zAJkaRRA 404 -1
zAJkAra 404 -1
zAJkArA 404 -1
zAJkARa 404 -1
zAJkaRRa 404 -1
zAJkaRA 404 -1
zANkhARRA 404 -1
zAJkara 404 -1
zAJkarA 404 -1
zAJkaRa 404 -1
zAJkARA 404 -1
zAJkARRa 404 -1
zAJkhaRRa 404 -1
zAJkhaRRA 404 -1
SaNkhara 404 -1
zAJkhArA 404 -1
zAJkhaRA 404 -1
zAJkhaRa 404 -1
zAJkARRA 404 -1
zAJkhara 404 -1
zAJkharA 404 -1
zANkhARRa 404 -1
zANkhARA 404 -1
zANkAra 404 -1
zANkArA 404 -1
zANkARa 404 -1
zANkARA 404 -1
zANkaRRA 404 -1
zANkaRRa 404 -1
zANkarA 404 -1
zANkaRa 404 -1
zANkaRA 404 -1
zANkARRa 404 -1
zANkARRA 404 -1
zANkhaRRA 404 -1
zANkhAra 404 -1
zANkhArA 404 -1
zANkhARa 404 -1
zANkhaRRa 404 -1
zANkhaRA 404 -1
zANkhara 404 -1
zANkharA 404 -1
zANkhaRa 404 -1
zAJkhARa 404 -1
SaNkhaRa 404 -1
SANkhara 404 -1
SANkharA 404 -1
SANkhaRa 404 -1
SANkhaRA 404 -1
SANkARRA 404 -1
SANkARRa 404 -1
SANkAra 404 -1
SANkArA 404 -1
SANkARa 404 -1
SANkARA 404 -1
SANkhaRRa 404 -1
SANkhaRRA 404 -1
SANkhARRA 404 -1
SAJkara 404 -1
SAJkarA 404 -1
SAJkaRa 404 -1
SANkhARRa 404 -1
SANkhARA 404 -1
SANkhAra 404 -1
SANkhArA 404 -1
SANkhARa 404 -1
SANkaRRA 404 -1
SANkaRRa 404 -1
SAMkharA 404 -1
SAMkhaRa 404 -1
SAMkhaRA 404 -1
SAMkhaRRa 404 -1
SAMkhara 404 -1
SAMkARRA 404 -1
SAMkARa 404 -1
SAMkARA 404 -1
SAMkARRa 404 -1
SAMkhaRRA 404 -1
SAMkhAra 404 -1
SANkara 404 -1
SANkarA 404 -1
SANkaRa 404 -1
SANkaRA 404 -1
SAMkhARRA 404 -1
SAMkhARRa 404 -1
SAMkhArA 404 -1
SAMkhARa 404 -1
SAMkhARA 404 -1
SAJkaRA 404 -1
SAJkaRRa 404 -1
SAGkARA 404 -1
SAGkARRa 404 -1
SAGkARRA 404 -1
SAGkhara 404 -1
SAGkARa 404 -1
SAGkArA 404 -1
SAGkaRA 404 -1
SAGkaRRa 404 -1
SAGkaRRA 404 -1
SAGkAra 404 -1
SAGkharA 404 -1
SAGkhaRa 404 -1
SAGkhARa 404 -1
SAGkhARA 404 -1
SAGkhARRa 404 -1
SAGkhARRA 404 -1
SAGkhArA 404 -1
SAGkhAra 404 -1
SAGkhaRA 404 -1
SAGkhaRRa 404 -1
SAGkhaRRA 404 -1
SAGkaRa 404 -1
SAGkarA 404 -1
SAJkARRa 404 -1
SAJkARRA 404 -1
SAJkhara 404 -1
SAJkharA 404 -1
SAJkARA 404 -1
SAJkARa 404 -1
SAJkaRRA 404 -1
SAJkAra 404 -1
SAJkArA 404 -1
SAJkhaRa 404 -1
SAJkhaRA 404 -1
SAJkhARA 404 -1
SAJkhARRa 404 -1
SAJkhARRA 404 -1
SAGkara 404 -1
SAJkhARa 404 -1
SAJkhArA 404 -1
SAJkhaRRa 404 -1
SAJkhaRRA 404 -1
SAJkhAra 404 -1
SAMkArA 404 -1
SAMkAra 404 -1
SaJkhAra 404 -1
SaJkhArA 404 -1
SaJkhARa 404 -1
SaJkhARA 404 -1
SaJkhaRRA 404 -1
SaJkhaRRa 404 -1
SaJkhara 404 -1
SaJkharA 404 -1
SaJkhaRa 404 -1
SaJkhaRA 404 -1
SaJkhARRa 404 -1
SaJkhARRA 404 -1
SaGkaRRA 404 -1
SaGkAra 404 -1
SaGkArA 404 -1
SaGkARa 404 -1
SaGkaRRa 404 -1
SaGkaRA 404 -1
SaGkara 404 -1
SaGkarA 404 -1
SaGkaRa 404 -1
SaJkARRA 404 -1
SaJkARRa 404 -1
SaNkhArA 404 -1
SaNkhARa 404 -1
SaNkhARA 404 -1
SaNkhARRa 404 -1
SaNkhAra 404 -1
SaNkhaRRA 404 -1
zAnkaRa 404 -1
SaNkhaRA 404 -1
SaNkhaRRa 404 -1
SaNkhARRA 404 -1
SaJkara 404 -1
SaJkAra 404 -1
SaJkArA 404 -1
SaJkARa 404 -1
SaJkARA 404 -1
SaJkaRRA 404 -1
SaJkaRRa 404 -1
SaJkarA 404 -1
SaJkaRa 404 -1
SaJkaRA 404 -1
SaGkARA 404 -1
SaGkARRa 404 -1
SAnkhaRA 404 -1
SAnkhaRRa 404 -1
SAnkhaRRA 404 -1
SAnkhAra 404 -1
SAnkhaRa 404 -1
SAnkharA 404 -1
SAnkARA 404 -1
SAnkARRa 404 -1
SAnkARRA 404 -1
SAnkhara 404 -1
SAnkhArA 404 -1
SAnkhARa 404 -1
SAMkaRa 404 -1
SAMkaRA 404 -1
SAMkaRRa 404 -1
SAMkaRRA 404 -1
SAMkarA 404 -1
SAMkara 404 -1
SAnkhARA 404 -1
SAnkhARRa 404 -1
SAnkhARRA 404 -1
SAnkARa 404 -1
SAnkArA 404 -1
SaGkhaRRa 404 -1
SaGkhaRRA 404 -1
SaGkhAra 404 -1
SaGkhArA 404 -1
SaGkhaRA 404 -1
SaGkhaRa 404 -1
SaGkARRA 404 -1
SaGkhara 404 -1
SaGkharA 404 -1
SaGkhARa 404 -1
SaGkhARA 404 -1
SAnkaRA 404 -1
SAnkaRRa 404 -1
SAnkaRRA 404 -1
SAnkAra 404 -1
SAnkaRa 404 -1
SAnkarA 404 -1
SaGkhARRa 404 -1
SaGkhARRA 404 -1
SAnkara 404 -1
SaNkharA 404 -1
zaGkhARRA 404 -1
sAnkara 404 -1
sAnkarA 404 -1
sAnkaRa 404 -1
sAnkaRA 404 -1
saGkhARRA 404 -1
saGkhARRa 404 -1
saGkhAra 404 -1
saGkhArA 404 -1
saGkhARa 404 -1
saGkhARA 404 -1
sAnkaRRa 404 -1
sAnkaRRA 404 -1
sAnkARRA 404 -1
sAnkhara 404 -1
sAnkharA 404 -1
sAnkhaRa 404 -1
sAnkARRa 404 -1
sAnkARA 404 -1
sAnkAra 404 -1
sAnkArA 404 -1
sAnkARa 404 -1
saGkhaRRA 404 -1
saGkhaRRa 404 -1
saGkarA 404 -1
saGkaRa 404 -1
saGkaRA 404 -1
saGkaRRa 404 -1
saGkara 404 -1
saJkhARRA 404 -1
saJkhARa 404 -1
saJkhARA 404 -1
saJkhARRa 404 -1
saGkaRRA 404 -1
saGkAra 404 -1
saGkhara 404 -1
saGkharA 404 -1
saGkhaRa 404 -1
saGkhaRA 404 -1
saGkARRA 404 -1
saGkARRa 404 -1
saGkArA 404 -1
saGkARa 404 -1
saGkARA 404 -1
sAnkhaRA 404 -1
sAnkhaRRa 404 -1
sAMkhARA 404 -1
sAMkhARRa 404 -1
sAMkhARRA 404 -1
sANkara 404 -1
sAMkhARa 404 -1
sAMkhArA 404 -1
sAMkhaRA 404 -1
sAMkhaRRa 404 -1
sAMkhaRRA 404 -1
sAMkhAra 404 -1
sANkarA 404 -1
sANkaRa 404 -1
sANkARa 404 -1
sANkARA 404 -1
sANkARRa 404 -1
sANkARRA 404 -1
sANkArA 404 -1
sANkAra 404 -1
sANkaRA 404 -1
sANkaRRa 404 -1
sANkaRRA 404 -1
sAMkhaRa 404 -1
sAMkharA 404 -1
sAnkhARRa 404 -1
sAnkhARRA 404 -1
sAMkara 404 -1
sAMkarA 404 -1
sAnkhARA 404 -1
sAnkhARa 404 -1
sAnkhaRRA 404 -1
sAnkhAra 404 -1
sAnkhArA 404 -1
sAMkaRa 404 -1
sAMkaRA 404 -1
sAMkARA 404 -1
sAMkARRa 404 -1
sAMkARRA 404 -1
sAMkhara 404 -1
sAMkARa 404 -1
sAMkArA 404 -1
sAMkaRRa 404 -1
sAMkaRRA 404 -1
sAMkAra 404 -1
saJkhArA 404 -1
saJkhAra 404 -1
saMkArA 404 -1
saMkARa 404 -1
saMkARA 404 -1
saMkARRa 404 -1
saMkaRRA 404 -1
saMkaRRa 404 -1
sankhARRA 404 -1
saMkarA 404 -1
saMkaRa 404 -1
saMkaRA 404 -1
saMkARRA 404 -1
saMkhara 404 -1
saMkhAra 404 -1
saMkhArA 404 -1
saMkhARa 404 -1
saMkhARA 404 -1
saMkhaRRA 404 -1
saMkhaRRa 404 -1
saMkharA 404 -1
saMkhaRa 404 -1
saMkhaRA 404 -1
sankhARRa 404 -1
sankhARA 404 -1
sankAra 404 -1
sankArA 404 -1
sankARa 404 -1
sankARA 404 -1
sankaRRA 404 -1
sankaRRa 404 -1
sankarA 404 -1
sankaRa 404 -1
sankaRA 404 -1
sankARRa 404 -1
sankARRA 404 -1
sankhaRRA 404 -1
sankhAra 404 -1
sankhArA 404 -1
sankhARa 404 -1
sankhaRRa 404 -1
sankhaRA 404 -1
sankhara 404 -1
sankharA 404 -1
sankhaRa 404 -1
saMkhARRa 404 -1
saMkhARRA 404 -1
saJkaRA 404 -1
saJkaRRa 404 -1
saJkaRRA 404 -1
saJkAra 404 -1
saJkaRa 404 -1
saJkarA 404 -1
saNkhARRa 404 -1
saNkhARRA 404 -1
saJkara 404 -1
saJkArA 404 -1
saJkARa 404 -1
saJkhaRa 404 -1
saJkhaRA 404 -1
saJkhaRRa 404 -1
saJkhaRRA 404 -1
saJkharA 404 -1
saJkhara 404 -1
saJkARA 404 -1
saJkARRa 404 -1
saJkARRA 404 -1
saNkhARA 404 -1
saNkhARa 404 -1
saNkaRRA 404 -1
saNkAra 404 -1
saNkArA 404 -1
saNkARa 404 -1
saNkaRRa 404 -1
saNkaRA 404 -1
saNkara 404 -1
saNkarA 404 -1
saNkaRa 404 -1
saNkARA 404 -1
saNkARRa 404 -1
saNkhaRRa 404 -1
saNkhaRRA 404 -1
saNkhAra 404 -1
saNkhArA 404 -1
saNkhaRA 404 -1
saNkhaRa 404 -1
saNkARRA 404 -1
saNkhara 404 -1
saNkharA 404 -1
sANkhara 404 -1
sANkharA 404 -1
zaNkharA 404 -1
zaNkhaRa 404 -1
zaNkhaRA 404 -1
zaNkhaRRa 404 -1
zaNkhara 404 -1
zaNkARRA 404 -1
zaNkArA 404 -1
zaNkARa 404 -1
zaNkARA 404 -1
zaNkARRa 404 -1
zaNkhaRRA 404 -1
zaNkhAra 404 -1
zaJkara 404 -1
zaJkarA 404 -1
zaJkaRa 404 -1
zaJkaRA 404 -1
zaNkhARRA 404 -1
zaNkhARRa 404 -1
zaNkhArA 404 -1
zaNkhARa 404 -1
zaNkhARA 404 -1
zaNkAra 404 -1
zaNkaRRA 404 -1
zaMkhaRa 404 -1
zaMkhaRA 404 -1
zaMkhaRRa 404 -1
zaMkhaRRA 404 -1
zaMkharA 404 -1
zaMkhara 404 -1
zaMkARA 404 -1
zaMkARRa 404 -1
zaMkARRA 404 -1
zaMkhAra 404 -1
zaMkhArA 404 -1
zaNkarA 404 -1
zaNkaRa 404 -1
zaNkaRA 404 -1
zaNkaRRa 404 -1
zaNkara 404 -1
zaMkhARRA 404 -1
zaMkhARa 404 -1
zaMkhARA 404 -1
zaMkhARRa 404 -1
zaJkaRRa 404 -1
zaJkaRRA 404 -1
zaGkARRa 404 -1
zaGkARRA 404 -1
zaGkhara 404 -1
zaGkharA 404 -1
zaGkARA 404 -1
zaGkARa 404 -1
zaGkaRRa 404 -1
zaGkaRRA 404 -1
zaGkAra 404 -1
zaGkArA 404 -1
zaGkhaRa 404 -1
zaGkhaRA 404 -1
zaGkhARA 404 -1
zaGkhARRa 404 -1
sankara 404 -1
zAnkara 404 -1
zaGkhARa 404 -1
zaGkhArA 404 -1
zaGkhaRRa 404 -1
zaGkhaRRA 404 -1
zaGkhAra 404 -1
zaGkaRA 404 -1
zaGkaRa 404 -1
zaJkARRA 404 -1
zaJkhara 404 -1
zaJkharA 404 -1
zaJkhaRa 404 -1
zaJkARRa 404 -1
zaJkARA 404 -1
zaJkAra 404 -1
zaJkArA 404 -1
zaJkARa 404 -1
zaJkhaRA 404 -1
zaJkhaRRa 404 -1
zaJkhARRa 404 -1
zaJkhARRA 404 -1
zaGkara 200 -1
zaGkarA 404 -1
zaJkhARA 404 -1
zaJkhARa 404 -1
zaJkhaRRA 404 -1
zaJkhAra 404 -1
zaJkhArA 404 -1
zaMkARa 404 -1
zaMkArA 404 -1
sAJkhAra 404 -1
sAJkhArA 404 -1
sAJkhARa 404 -1
sAJkhARA 404 -1
sAJkhaRRA 404 -1
sAJkhaRRa 404 -1
sAJkhara 404 -1
sAJkharA 404 -1
sAJkhaRa 404 -1
sAJkhaRA 404 -1
sAJkhARRa 404 -1
sAJkhARRA 404 -1
sAGkaRRA 404 -1
sAGkAra 404 -1
sAGkArA 404 -1
sAGkARa 404 -1
sAGkaRRa 404 -1
sAGkaRA 404 -1
sAGkara 404 -1
sAGkarA 404 -1
sAGkaRa 404 -1
sAJkARRA 404 -1
sAJkARRa 404 -1
sANkhArA 404 -1
sANkhARa 404 -1
sANkhARA 404 -1
sANkhARRa 404 -1
sANkhAra 404 -1
sANkhaRRA 404 -1
sANkhaRa 404 -1
sANkhaRA 404 -1
sANkhaRRa 404 -1
sANkhARRA 404 -1
sAJkara 404 -1
sAJkAra 404 -1
sAJkArA 404 -1
sAJkARa 404 -1
sAJkARA 404 -1
sAJkaRRA 404 -1
sAJkaRRa 404 -1
sAJkarA 404 -1
sAJkaRa 404 -1
sAJkaRA 404 -1
sAGkARA 404 -1
sAGkARRa 404 -1
zankhaRa 404 -1
zankhaRA 404 -1
zankhaRRa 404 -1
zankhaRRA 404 -1
zankharA 404 -1
zankhara 404 -1
zankARA 404 -1
zankARRa 404 -1
zankARRA 404 -1
zankhAra 404 -1
zankhArA 404 -1
zaMkaRA 404 -1
zaMkaRRa 404 -1
zaMkaRRA 404 -1
zaMkAra 404 -1
zaMkaRa 404 -1
zankhARRA 404 -1
zankhARa 404 -1
zankhARA 404 -1
zankhARRa 404 -1
zankARa 404 -1
zankArA 404 -1
sAGkhaRRa 404 -1
sAGkhaRRA 404 -1
sAGkhAra 404 -1
sAGkhArA 404 -1
sAGkhaRA 404 -1
sAGkhaRa 404 -1
sAGkARRA 404 -1
sAGkhara 404 -1
sAGkharA 404 -1
sAGkhARa 404 -1
sAGkhARA 404 -1
zankaRA 404 -1
zankaRRa 404 -1
zankaRRA 404 -1
zankAra 404 -1
zankaRa 404 -1
zankarA 404 -1
sAGkhARRa 404 -1
sAGkhARRA 404 -1
zankara 404 -1
zAnkarA 404 -1
funderburkjim commented 7 years ago

Regarding BakzaMkAra

Error message is:

responded with a status of 414 (Request-URI Too Large)

This has to do with GET and POST methods of HTTP communication to server.
Currently, system is using "GET". I've read that GET requests have a limit, but never ran into that limit before.

Changed to POST, so now it works. There are 3840 variants generated for BakzaMkAra

funderburkjim commented 7 years ago

In this work on v0.1, I made a change that zapped the retrieval of images.

To undo that error, v0.1 currently doesn't function properly. Working on solution.

funderburkjim commented 7 years ago

I think this problem fixed now. v0.1 seems to be working as before.

gasyoun commented 7 years ago

3840 variants

Oh my....

Server checked 3840 alternates. 200 code means found in mw. 3rd field is word frequencey score (or -1 or -9)

funderburkjim commented 7 years ago

I'm working now on moving the variant-generation algorithm from js to php. That will make it easier to improve the algorithm. Clearly there are some 'impossible' spellings being generated, like aRRA (hk) = aFA (slp1) --- vowel+vowel+vowel;
and these impossibles should be discarded by the variant-generation.

gasyoun commented 7 years ago

Clearly there are some 'impossible' spellings being

That would require a sandhi tool testing, I guess.

@SergeA checked matri and 48 alternates are wrong. The right mAtR was found only by http://spokensanskrit.de/index.php?beginning=0+&tinput=+matri&trans=Translate, so I guess we need to add more equations.

funderburkjim commented 7 years ago

Here's version 0.2.

matri, vishnu, krishna all found.

@SergeA @gasyoun Find some more that v0.2 misses!

drdhaval2785 commented 7 years ago

Not actually missing one, but an enhancement. I entered punya with an expectation to see puRya, which I very well got. But it took me close to three four seconds. I wonder why phu should be thought instead of pu. Let us keep this business restricted to 'P' as in SLP1, and s/z/S. For rest of consonants, people dont write p for ph. This will eliminate so many sure shot alternates.

drdhaval2785 commented 7 years ago

screenshot_20170603-102933

gasyoun commented 7 years ago

people dont write p for ph.

Hmm, so no one could ever write kapa or kafa for kapha?

drdhaval2785 commented 7 years ago

Hmm, so no one could ever write kapa or kafa for kapha?

kafa is possible. kapa never. At least never from Indian subcontinent. I am not sure about Europe or America.

gasyoun commented 7 years ago

kafa is possible

So it's something to add.

I am not sure about Europe or America.

That needs testing. We need to gather data on what people will enter. After we can decide to kill or not. I'm willing to kill it as well, but let's see what people actually enter and not what they should. This whole thing is about what people do and not what they are thought to.

drdhaval2785 commented 7 years ago

Issue is the system is sufficiently slow. No need to slow it for virtually impossible items.

gasyoun commented 7 years ago

Issue is the system is sufficiently slow.

It is slow. But not because of a few combinations. The whole approach needs to be changed. @juhnowski any ideas why it's so slow compared to http://spokensanskrit.de/index.php?beginning=0+&tinput=+kafa&trans=Translate (and yes we will have this feature, but they have not):

kava     कव 
kSava     क्षव 
kUpa     कूप 
kab     कब् 
kahva     कह्व 
kapha     कफ 
kapi     कपि 
kaupa     कौप 
kav     कव् 
kavi     कवि 

What I like at ss.de

kapi 
kappa 
kavi 
kva 
Kap 
kApI 
kApya 
kAvya 
kSepa 
kav/i 

Is the

Now matri finds mAtR (mAtf) 200 69 as expected. But will not find mAtar nor mAtari. So what should we do? And how to find words that contain what we search? So not only exact match, but string mode as well.

adrimAtar:PW,PWG
anAmAtarjanIdvaya:PD
anAmAtarjanIyuga:PD
anAmAtarjanyagra:PD
anumAtar:SCH
aparamAtar:BHS
amAtar:PW
ayanamAtar:PWG
arTamAtar:SCH
alamAtardana:MW,PW
avantimAtar:PW
aSezamAtar:SCH
aSvamAtar:PW
asramAtar:PW
AkASamAtar:BHS
indramAtar:PW,PWG
ihehamAtar:PW,PWG,SCH
upamAtar:PW,PWG,SCH
fdDilamAtar:BHS
kandarpamAtar:PW
kARelimAtar:PW
kARelImAtar:PWG
kIwamAtar:PW
kuntImAtar:PW
ganDamAtar:PW,PWG
gomAtar:PW,PWG
gomAtara:BUR
citpramAtar:SCH
jaganmAtar:CCS,PW,PWG
jantumAtar:PW
jAmAtar:CCS,PW,PWG
trimAtar:PW,PWG
duhitAmAtar:CCS,PW
devamAtar:CCS,PW,PWG
devamAtara:PUI
dEtyamAtar:PW,PWG
dvimAtar:PW,PWG
DAnyamAtar:PW,PWG
DmAtar:CCS,PW,PWG
nAgamAtar:CCS,PW,PWG
nirmAtar:CCS,PW,PWG
parapramAtar:SCH
pfSnimAtar:PW,PWG
pramAtar:CCS,PW,PWG,SCH
prARimAtar:PW,PWG
BadramAtar:PW,PWG
BAgamAtar:PW,PWG
BizaNmAtar:PW,PWG
BuvanamAtar:PW
BUtamAtar:PW,PWG
maRqUkamAtar:PW,PWG
martyendramAtar:PW
mahAkASamAtar:BHS
mahAmAtar:PW
mahimAtaraMga:MW,PW
mAtar:CCS,PW,PWG,SCH
mAtara:PUI
mAtaraH:IEG
mAtarapitarO:MW,PW,PWG,SKD
mAtarapitf:SHS,VCP,WIL,YAT
mAtarApitf:GRA
mAtari:MW
mAtaripuruza:MW,MW72,PW,PWG,SCH
mAtaripuruzaH:AP,AP90
mAtariBvan:GRA
mAtariBvarI:MW,PW
mAtariSva:MW,MW72,PUI,PW,PWG
mAtariSvaka:MW,PW,PWG
mAtariSvan:AP,AP90,BEN,BUR,CAE,CCS,GRA,INM,MCI,MD,MW,PE,PW,PWG,SHS,STC,VCP,VEI,WIL,YAT
mAtariSvarI:MW,PW
mAtariSvA:SKD
mAtariSvAna:PUI
mAtfmAtar:PW,PWG
mAyApramAtar:SCH
muktAmAtar:PW,PWG
mfgAramAtar:BHS,SCH
yAmAtar:PW,PWG
yogamAtar:PW,PWG
yogimAtar:PW,PWG
raNgamAtar:PW,PWG
rasamAtar:PW
rAjamAtar:CCS,PW,PWG
rAhulamAtar:BHS
lokamAtar:PW,PWG,SCH
lokAnAMmAtaraH:INM
lohityAyanamAtar:PW,PWG
varRamAtar:PW,PWG
vijAmAtar:PW,PWG
vinirmAtar:PW,PWG
vimAtar:PW,PWG
viSvamAtar:PW,PWG
vIramAtar:PW,PWG
vedamAtar:PW,PWG
vEdyamAtar:PW,PWG
vyAsamAtar:PW
SakramAtar:PW,PWG
SatasahasramAtar:BHS
SUnyapramAtar:SCH
sanmAtar:PW
saptamAtar:PW,PWG
samAtar:PW,PWG
saMmAtar:CCS,PW,PWG
sarvamAtar:PW,PWG
sinDumAtar:CCS,PW,PWG
sumAtar:PW,PWG
sOmyajAmAtar:PW,PWG
skandamAtar:PW,PWG
svarRamAtar:PW,PWG
svedamAtar:PW
hatamAtar:PW,PWG
SergeA commented 7 years ago

matri, vishnu, krishna all found.

I'd suggest also to add the spelling "krushna". Some people say it this way.

gasyoun commented 7 years ago

spelling "krushna"

Indeed, none found.

Server returns 1152 alternates. 
200 code means found in mw. 
3rd field is word frequency score (or -1 or -9)

khRusna (Kfusna) 404 -9
khRuSmA (KfuzmA) 404 -9
khRuSma (Kfuzma) 404 -9

What about sanskrit?

Should it find saMskfta (+saMskftaM) and saMskfti?

Server returns 12000 alternates. 
200 code means found in mw. 
3rd field is word frequency score (or -1 or -9)

ShaJshkhRRth (zhaYshKFT) 404 -9
ShaGskrit (zhaNskrit) 404 -9
ShaGskriT (zhaNskriw) 404 -9
drdhaval2785 commented 7 years ago

20170603_133122

drdhaval2785 commented 7 years ago

Not having access to computer. So scribbled on a page and photo taken from mobile. This takes care of majorly ITRANS, HK and SLP1. If something is left out, we can discuss and add.

gasyoun commented 7 years ago

If something is left out, we can discuss and add.

https://github.com/sanskrit-lexicon/Cologne/issues/8#issuecomment-277377696 was how it begun, https://github.com/sanskrit-lexicon/Cologne/issues/8#issuecomment-94280465 is still pending (Jim, you can!), https://github.com/juhnowski/sanskrit-simple-search/blob/master/fetching.html is how it looked before Jim started.

  var transitionTable = [
    ["a","A"],
    ["i","I"],
    ["u","U"],
    ["r","R","RR"],
    ["l","lR","lRR"],
    ["h","H"],
    ["M","n","N","J","G"],
    ["z","S","s"],
    ["b","v"],
    ["k","kh"],
    ["g","gh"],
    ["c","ch"],
    ["j","jh"],
    ["T","Th","t","th"],
    ["D","Dh","d","dh"],
    ["p","ph"],
    ["b","bh"],
    ["sh","z"]
]
funderburkjim commented 7 years ago

Likely causes of slowness

transcoder

In v0.2, the variants are generated using HK (see table above). Database searches require SLP1. So hundreds/thousands of HK spellings must be transcoded to SLP1. This is relatively slow

unneeded database accesses

Once we have SLP1 spellings, the current technique checks the database (MW) for every spelling variant. Database access (i/o) is relatively slow compared to computation.

We should be able to exclude certain spellings without reference to database.
We could generate a table of known 2-grams, and not bother to check database for a spelling that has a non-existent 2-gram. Such a table would have several hundred-thousand 2-grams, and be accessed by hashing process (PHP associative array). This would be much faster way to exclude such cases than database lookup.

funderburkjim commented 7 years ago

test suite

We need to have a test suite.
This would be a list of input spellings and output results that any technique should generate. When we vary the algorithm, we should validate that the new algorithm still passes the test suite.

Reason: a change in algorithm aimed to enhance the results could have undesired side effect of failing to solve previously solved spellings. We won't know this unless we have a test suite.

gasyoun commented 7 years ago

thousands of HK spellings must be transcoded to SLP1

Now I see why it was a bad idea. Goot that it's an easy fix.

table of known 2-grams

Was thinking about the same today. And not only that. Some can be in the beginning, some only in middle and not all at the end of a word. That we should keep in mind, I guess.

new algorithm still passes the test suite

Sounds amazing, too smart for me.

funderburkjim commented 7 years ago

v0.3

v0.3 is faster due to:

It can be made faster by weeding out unknown initial 2-grams.

krushna, matar, give expected results.

Made an experiment with 'f' (kafa) --- this is odd because 'f' is not an HK letter, but is an SLP1 letter. So its handling must be different.

Will consider Dhaval's scratch sheet next.

Suggestions ?

drdhaval2785 commented 7 years ago

v0.3 seens reasonably faster.

One more python suggestion.

if member in list is considerably slower than if member in set(list).

If you use the ngram list instead of set, converting to set will improve performance multifold .

drdhaval2785 commented 7 years ago

Where is the code by the way, @funderburkjim?

funderburkjim commented 7 years ago

In this case, all the code is php. I am using an associative array for the ngram lookups, which should be relatively fast.

Code is on Cologne server.

drdhaval2785 commented 7 years ago

https://stackoverflow.com/questions/13483219/what-is-faster-in-array-or-isset may be of interest for speed up

funderburkjim commented 7 years ago

I'm actually using isset($ngram['xyz']) which seems to be what stackoverflow suggests.

in_array('xyz',$ngrams) looks comparable to python 'xyz in $ngrams` -- both slow.

What I'm doing now is ngram checking while alternates are generated. e.g., if a potential alternate starts with X and X contains bad ngrams,then any alternate XY will also have bad ngrams - hence no need to consider possible Ys.

funderburkjim commented 7 years ago

v0.3a

v0.3a changes:

gasyoun commented 7 years ago

I wake up and what do I see? A dream come true.

It can be made faster by weeding out unknown initial 2-grams.

It's so quick now!

Results are impressing, Jim.

Server returns 12 alternates. 
200 code means found in mw. 
3rd field is word frequency score (or -1 or -9)

bhaj (Baj) 200 65
bhAj (BAj) 200 54
vaj (vaj) 200 -1
NF (vaJ) 404 -9
NF (vAj) 404 -9

I've done something with 'f' --- could you review this, suggest improvements if needed.

Perfect

kapha (kaPa) 200 39
kapa (kapa) 200 0
kApA (kApA) 200 -1
NF (KApA) 404 -9
NF (KApa) 404 -9

What about sanskrit? Should it find saMskfta (+saMskftaM) and saMskfti?

v0.3a gets close but finds nothing. @drdhaval2785 @funderburkjim should it find it? Should sanskrit bring us saMskfta or that's too bad input to get good results?

NF (saMskrt) 404 -9
NF (saMskft) 404 -9
NF (saMskarT) 404 -9

As of doubling, Acaryya finds exactly what it should:

Server returns 13 alternates. 
200 code means found in mw. 
3rd field is word frequency score (or -1 or -9)

Acarya (Acarya) 200 0
NF (acaryA) 404 -9
NF (acariya) 404 -9
NF (acariyA) 404 -9
NF (acaruyA) 404 -9
NF (acaruya) 404 -9
NF (acarya) 404 -9
NF (Acfya) 404 -9
NF (AcaryA) 404 -9
NF (Acariya) 404 -9
NF (AcariyA) 404 -9
NF (Acaruya) 404 -9
NF (AcaruyA) 404 -9

Entry kuw is fine

kuT (kuw) 200 1
kUT (kUw) 200 -1
kuth (kuT) 200 -1
kut (kut) 200 -1
NF (KUw) 404 -9
NF (Kuw) 404 -9
NF (kUt) 404 -9
NF (kuW) 404 -9

I woul like to see kuwa in the results as well (+1 letter at the end scenario)

Compare with http://spokensanskrit.de/index.php?beginning=0+&tinput=+kuT&trans=Translate

funderburkjim commented 7 years ago
  • 1 letter

for word ending in consonant, also try word + vowel. That would get sanskrit.

Good idea.

I'd like 'dukha' to find 'duHKa' (SLP1).

It would be good to do some Edit Distance comparisons. e.g., given word spelling W (slp1) find all words in a given list of words (e.g. the headwords of MW) within edit distance D of W. It is clear how to do such a computation. BUT not clear how to make such a computation efficient enough to be of practical use. This sounds like a problem that should have been solved in computer science.

I'm going to let this simmer a few days before making further adjustments.

Request others to find cases where the algorithm is missing something it should get.

drdhaval2785 commented 7 years ago

I tried with 'dhaval' and expected to see 'Davala'. Did not come. It is a phenomenon known as 'schwa syncope'. Terminal 'a' is dropped under influence of local languages.

It would be better if we can look up for input+a in case input ends with a consonant. dhaval and kuw issue will be resolved.

drdhaval2785 commented 7 years ago

word + vowel

I would say word + a

gasyoun commented 7 years ago

I would say word + a

As a possible starting point.

funderburkjim commented 7 years ago

@gasyoun
I'm thinking that the next thing to do is to have the spelling interface with the hwnorm1 spellings. This would permit one to get the right word in AP90 for ashva, for instance (where the actual headword spelling is aSvaH - with the visarga at the end).

What do you see as a good next step?

gasyoun commented 7 years ago

What do you see as a good next step?

I would be happy to see it even as it is. What you propose is a good addition (and you know they are endless), not critical at beta testing.

funderburkjim commented 7 years ago

v0.3b hwnorm1 version

Here's a next step : v0.3b fetching.

This version accesses a (newly created) database form of hwnorm1c.

Here's output for 'ashvah':

1: azva (aSva) 200 70
    aSva:BEN,BHS,BOP,BUR,CAE,CCS,GRA,IEG,INM,MD,MW,MW72,PE,PUI,PW,PWG,SCH,SHS,STC,VCP,VEI,WIL,YAT
    aSvaH:AP,AP90,SKD
2: asva (asva) 200 4
    asva:AP,AP90,MD,MW,MW72,PW,SCH,SHS,STC,VCP,WIL,YAT
    asvaH:SKD
3: Azva (ASva) 200 0
    ASva:AP,AP90,BUR,CAE,CCS,MD,MW,MW72,PW,PWG,SHS,VCP,WIL,YAT
    ASvaM:SKD
4: Asva (Asva) 200 -1
    Asva:CCS,MD,PW

Others are encouraged to experiment.

another possibility for prioritizing results

Currently, we're using the word frequency list (this is from Marcis -- not sure where documented -- it was from DCS ?). Another possibility would be to prioritize on basis of the number of Cologne dictionaries containing the normalized spelling. Check out 'siva' as example.

more tuning?

There are some odd alternates -- look at 'hari', 'shankara', 'karmman', 'sangama'.

Not sure whether these odd alternates require tuning.

how to integrate into a display?

Also not sure how to integrate this into a 'real' display. Welcome suggestions on this point.

Allow Devanagari or IAST in input?

It seems likely that Allowing either Devanagari or IAST in input would be a modest enhancement -- The program could first do a trancoding from Devanagari or IAST into HK. Then proceed as if the user had typed HK. Might be able to do similar with ITRANS. If this works, then we would have a solution to 'auto-detection' of input .

Not sure whether this is important to do now.

gasyoun commented 7 years ago

it was from DCS

Exactly.

Another possibility would be to prioritize on basis of the number of Cologne dictionaries containing the normalized spelling. Check out 'siva' as example.

As an option, indeed.

Might be able to do similar with ITRANS.

It's dead. Let it swim down the river.

a solution to 'auto-detection' of input

Hurray!

Not sure whether this is important to do now.

It is! It's much more important than converting one more out of 33 dictionaries to IAST. It's a universal UI. We have it as a playground for a few months, yet nothing in real life.

Also not sure how to integrate this into a 'real' display. Welcome suggestions on this point.

What about a list of IDs (invisible table)? Anything will do to see it in action, not just a blank page.

funderburkjim commented 7 years ago

list-0.2s implementation of simple search

list-0.2s.html is a generalization of list-0.2.html. One of the input options is 'simple'. Give it a try !

gasyoun commented 7 years ago

Give it a try !

Finally, thanks Jim! I search for aja in HK mode. It returns (in a smart way)

अजा
अ-ज
अ-जा

And that's good. What if there are many words, maybe add an index above with anchor links on the same page, so I do not have to scroll to know what are the possible variants? When I chose simple mode nothing changed, it seems all modes have become smarter.

index

As the index is above the usual interface, did not notice it. Maybe add a header above? Possible solutions,

कृ
खरु
क्रु
कॄ
करि

Jim, you have finished what I have asked for, everything, when I search for varanasi I get वराणसी. I give you my thanks. But why when I searched go I got directly to go articles without any index and no possibility to know that there is even gai? But the go case is not of mega importance, asking to understand if I have missed anything.

In a case like:

मण्डूक
मधुक
मधूक
माधुक
मधुका
माण्डूक
माधूक
नान्दुक
मादुक
मण्डुक

nofit

everything does not fit my page anymore. Thinking loud, maybe

मण्डूक ; मधुक ; मधूक ; माधुक ; मधुका ; माण्डूक ; माधूक ; नान्दुक ; मादुक ; मण्डुक

is a solution, @drdhaval2785 ?

When I entered manduk I got WORD NOT FOUND in mw dictionary and must say http://spokensanskrit.de/index.php?beginning=0+&tinput=manduk+&trans=Translate failed as well:

mandAka     मन्दाक 
mandaka     मन्दक 
maNDukI     मण्डुकी 
madgu     मद्गु 
madhukA     मधुका 
madhuka     मधुक 
madhus     मधुस् 
madikA     मदिका 
mandAkSa     मन्दाक्ष 
mandAsu     मन्दासु 

So we do not add endings? I mean (based on real life examples), a translator of an ayurvedic book (who has never learned Sanskrit and still needs to translate a book, written by an Indian in English) finds jal instead of jala in a printed book and will never find the real word, even in our dictionaries.

funderburkjim commented 7 years ago

Why 'go' shows no list of possibilities

Question above mentioned 'gai'. 'gai' is not one of the alternates generated by 'go'. The only ones are (in SLP1) go,Go. But Go is not found in the MW dictioanary (which you specified in above display), while 'go' is found. Thus, there is ONLY ONE SOLUTION.

In the situation where there is only one solution, the display does not show a list of possibilities with just one member.

Note a given citation will present different behavior depending on the dictionary. For instance, try 'guru' with

The possibilities are clickable

When there is more than one possibility, the variants are clickable. The first variant is displayed initially. Clicking on another variant shows the display for that variant.

gasyoun commented 7 years ago

In the situation where there is only one solution, the display does not show a list of possibilities with just one member.

And it's good.

Note a given citation will present different behavior depending on the dictionary. For instance, try 'guru' with

Indeed.

funderburkjim commented 7 years ago

optional ending schwa

An additional rule was added to the alternate generation to deal with the 'manduk' and 'jal' examples mentioned above. Both of these are cases where the final 'a' of Sanskrit spelling has been omitted. As Dhaval has mentioned elsewhere, this schwa deletion is common in modern Indian languages.

So, when the user input ends in a consonant, and the variants are generated, for each variant we'll add an extra variant with an extra 'a'.

This will let the manduk and jal examples generate desired answer.

it will also have some words generate additional possibilities (think 'gam', which will now show 'gama' also).

funderburkjim commented 7 years ago

showing alternates with a dropdown menu in simple

This takes care of the problem of a long list of alternates between the citation and the display. The number of alternates is shown. This menu appears even if there is only one option (slightly different from prior version). Selecting one of the alternates from the menu changes the display to that option.

autodetection of Devanagari / IAST in simple

You can copy/paste (or type) either Devanagari unicode or IAST into the citation. A conversion will be done automatically.

Note the output can be changed to IAST or Devanagari or any of the other output options. (previously it was hard-coded to Devanagari).

gasyoun commented 7 years ago

it will also have some words generate additional possibilities (think 'gam', which will now show 'gama' also).

If there is gam in DB, maybe no need for gama? In jal case there was no result before, as compared with gam, where we had.

pook

And I got phuka, impressed.

Selecting one of the alternates from the menu changes the display to that option.

It's good and bad at the same time. It's harder to notice (but anyway I'll write a FAQ on how to use it, it's no more obvious) and you have to click, to see what's there and can't copy-paste the list in list form.

If used at all, maybe make the dictionary list dropdown as well? I remember the abbreviations (not always), you do and Dhaval, but what about the rest? I guess it's abracadabra for them.

I'm thinking loud of a fine tuning. Suppose I entered danda and I would actually want the daṇḍa to bee found and not dada, that is more common and because of that comes first. What if not all conversions are equal. What if we give a priority to those variants where the number of letters maches first?

danda

And the English-Sanskrit Dictionaries do not work in this issue? Searched for love, found none.

ae

I'm thinking about what we still miss to become more popular than our clone:

kala

funderburkjim commented 7 years ago

daRqa v. dada

The display currently shows 'dada' as preferential to daRqa in simple search for 'danda'.

The preferences come from the word_frequency list.

Thus far, two layers have been uncovered in this problem:

gasyoun commented 7 years ago

Were you aware of these duplicates?

No.

do you know how to interpret these duplicates?

No.

What I wanted to say, even if the frequency is higher, we should 1st show a word that matches the number of letters. Agree?

funderburkjim commented 7 years ago

adjusting word frequency for duplicates

Of the 72933 records in word_frequency , there are 4932 words which appear more than once.

These words, along with the various frequencies, are shown in word_frequency_dups .

One way to resolve duplicates is to take the MAX frequency. This results in word_frequency_adj , which now has 67050 distinct words.

The display uses these adjusted word frequencies to order the results. This gives some definite improvement, e.g. daRqa is now first result for 'danda', 'Siva' is now first result for 'siva' (formerly sivA was first), etc.

we should 1st show a word that matches the number of letters, Agree?

Not yet. Let's first find some examples where the now corrected word_frequency ordering looks wrong.

We may need to alter the word frequency file further to take into account normalized spellings (of hwnorm1) -- not sure of this..

funderburkjim commented 7 years ago

Refine Dictionary menu

maybe make the dictionary list dropdown as well?

One reason for using this autocomplete form has to do with the non-public dictionaries. These are not in the list of suggestions, but they may be typed in by users who know the code. This would not be possible using a normal <select><option> drop-down menu.

However, I've spent some time learning some of the finer points of using the autocomplete widget of jquery UI, and this now behaves more like a drop down menu. In particular, when you focus on the element (by clicking in it), the full list displays -- then you can select another dictionary or not. The suggestion aspect also works, so if you type a letter or two, the list is narrowed down.

@gasyoun Hope you like the change.