suttacentral / legacy-suttacentral

Source code and related files (CSS, images, etc.) for SuttaCentral
http://suttacentral.net/
Other
14 stars 4 forks source link

Search optimizing #152

Closed sujato closed 6 years ago

sujato commented 8 years ago

The search still needs some improvements, here are some quirks that I notice as I go. I posted this list in Feb 2016, and in March 2018 can confirm the same problems are still there!

  1. search for uppādetvā gives a few exact matches, then inexact ones, then exact ones. This might make sense if the inexact matches were in more important texts, but this is not the case. Prioritize exact matches
  2. search for uppadetva gives only Dictionary entries. Version without diacriticals should give wider results, not more narrow.
  3. search for uppadetv or uppādetv give no results. Match parts of a word, especially when there are no full matches.
  4. https://suttacentral.net/search?query=digham+va+assasanto yields a lot of entries for "vā". The top sutta entry is the Metta Sutta! Ironically though, searching for "Metta Sutta" gives a bunch of results for "sutta", but not the Metta Sutta. Prioritize full phrase, and if selecting for individual words, filter out the small ones.
  5. jīvitasaṅkhāra give only a dictionary result, while jīvitasaṅkhār* gives the correct results.

Other enhancements.

  1. Enable limiting search by text or division.
  2. Enable literal string searches.
  3. Provide more weighting for EBTs.
  4. Integrate option for using Google site search.

Possible ideas to be explored:

  1. An interface for using the dictionaries more directly.
  2. Can we pull in results for CPD? (http://pali.hum.ku.dk/cpd/search.html)
yapcheahshen commented 8 years ago

There are invisible character (\uad) in between long words, therefore "Mālā­gandha­vi­lepa­na­dhāra­ṇa­maṇḍa­na­vi­bhū­sa­naṭ­ṭhānā" in DN1 will be splitted into several tokens: Mālā, gandha ,vi ,lepa ,na ,dhāra ,ṇa ,maṇḍa ,na ,vi ,bhū ,sa ,naṭ ,ṭhānā, if it doesn't follow sandhi rule, the search result might not be precise.

yapcheahshen commented 8 years ago

DN1 sc id="97" evaṃa­bhisam­parāyā is tokenized to "evaṃa"+"­bhisam"+­"parāyā" in this case, you can't search "evaṃ" , "a­bhisam­parāyā" or "samparāyā".

sujato commented 8 years ago

We insert soft hyphens so as to be able to break long words in Pali. Perhaps the means of doing this could be more sophisticated, but it works reasonably well.

I have noticed the bug you refer to, not on our main site, but on our translation app Pootle, see here.

There, it is definitely a problem, but I have not noticed it as a problem on the main site. Are you referring to searching on SuttaCentral, or just searching with our texts in another context? Because I get the same results for the tokenized or plain terms on the site.

So there are two distinct issues:

  1. Where to insert the soft-hyphens
  2. When to do it.

As you point out, they're not coded according to sandhi rules. Instead they just follow a simple syllable-breaking algorithm. There's no particular need to have lots of such breaks; we're not justifying text, just avoiding occasional excesses. So it would make more sense to break compounds on sandhi breaks. If you know of a script for implementing this it would be helpful.

As for when to do it, currently these are hard-coded in our HTML files. It might be better to do it on the fly, and insert required breaks when needed via javascript. We could perhaps use something like this This gets complicated, however, when doing things like resizing pages, applying the text in different contents like ebooks or LaTeX, etc. Basically, wherever you use Pali text long words will be a problem, and we can't expect applications to understand Pali word-breaking. On the whole, i think it may be best to keep them hard-coded. It's easier to take them out than add them in.

yapcheahshen commented 8 years ago

https://suttacentral.net/pi/dn1 evaṃa­bhisam­parāyā is 11th word in paragraph

here is a list of 150K terms extracted from Pali-Burmese dictoinary, https://github.com/yapcheahshen/pced/blob/master/burmeseterms.txt they are semantic break up instead of morphological break up.

I tried brute force check with above list on Tipitaka, sadly only 50% of long words can be found, might need to develop a smarter algorithm.

Instead of one level break up of long words, I am thinking of preserving the nested structure of long words.

mahābodhisattasammasana=mahābodhisatta+sammasana mahābodhisatta=mahanta+bodhisatta bodhisatta=bodhi+satta

mahābodhisattasammasana=((mahā (bodhi satta)) sammasana)

A complete list of stem words will open up many new possibilities, e.g, apply huffman encoding to the text, assign each word a unique id, like Strong Number, etc.

2016-04-05 17:47 GMT+08:00 sujato notifications@github.com:

We insert soft hyphens so as to be able to break long words in Pali. Perhaps the means of doing this could be more sophisticated, but it works reasonably well.

I have noticed the bug you refer to, not on our main site, but on our translation app Pootle http://pootle.suttacentral.net, see here https://discourse.suttacentral.net/t/pootle-2-7-development-issues/2562/4?u=sujato.

There, it is definitely a problem, but I have not noticed it as a problem on the main site. Are you referring to searching on SuttaCentral, or just searching with our texts in another context? Because I get the same results for the tokenized or plain terms on the site.

So there are two distinct issues:

  1. Where to insert the soft-hyphens
  2. When to do it.

As you point out, they're not coded according to sandhi rules. Instead they just follow a simple syllable-breaking algorithm. There's no particular need to have lots of such breaks; we're not justifying text, just avoiding occasional excesses. So it would make more sense to break compounds on sandhi breaks. If you know of a script for implementing this it would be helpful.

As for when to do it, currently these are hard-coded in our HTML files. It might be better to do it on the fly, and insert required breaks when needed via javascript. We could perhaps use something like this https://github.com/davidmerfield/Typeset This gets complicated, however, when doing things like resizing pages, applying the text in different contents like ebooks or LaTeX, etc. Basically, wherever you use Pali text long words will be a problem, and we can't expect applications to understand Pali word-breaking. On the whole, i think it may be best to keep them hard-coded. It's easier to take them out than add them in.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/suttacentral/suttacentral/issues/152#issuecomment-205733405

sujato commented 8 years ago

That would be great if it's doable. we are doing something similar with our ongoing revision of the Condensed pali Dictionary. Entries are broken down to successively simpler forms in much the same way as you describe. However we have not automated this, it is being done slowly and incompletely by hand.