Closed sujato closed 6 years ago
There are invisible character (\uad) in between long words, therefore "Mālāgandhavilepanadhāraṇamaṇḍanavibhūsanaṭṭhānā" in DN1 will be splitted into several tokens: Mālā, gandha ,vi ,lepa ,na ,dhāra ,ṇa ,maṇḍa ,na ,vi ,bhū ,sa ,naṭ ,ṭhānā, if it doesn't follow sandhi rule, the search result might not be precise.
DN1 sc id="97" evaṃabhisamparāyā is tokenized to "evaṃa"+"bhisam"+"parāyā" in this case, you can't search "evaṃ" , "abhisamparāyā" or "samparāyā".
We insert soft hyphens so as to be able to break long words in Pali. Perhaps the means of doing this could be more sophisticated, but it works reasonably well.
I have noticed the bug you refer to, not on our main site, but on our translation app Pootle, see here.
There, it is definitely a problem, but I have not noticed it as a problem on the main site. Are you referring to searching on SuttaCentral, or just searching with our texts in another context? Because I get the same results for the tokenized or plain terms on the site.
So there are two distinct issues:
As you point out, they're not coded according to sandhi rules. Instead they just follow a simple syllable-breaking algorithm. There's no particular need to have lots of such breaks; we're not justifying text, just avoiding occasional excesses. So it would make more sense to break compounds on sandhi breaks. If you know of a script for implementing this it would be helpful.
As for when to do it, currently these are hard-coded in our HTML files. It might be better to do it on the fly, and insert required breaks when needed via javascript. We could perhaps use something like this This gets complicated, however, when doing things like resizing pages, applying the text in different contents like ebooks or LaTeX, etc. Basically, wherever you use Pali text long words will be a problem, and we can't expect applications to understand Pali word-breaking. On the whole, i think it may be best to keep them hard-coded. It's easier to take them out than add them in.
https://suttacentral.net/pi/dn1 evaṃabhisamparāyā is 11th word in paragraph
here is a list of 150K terms extracted from Pali-Burmese dictoinary, https://github.com/yapcheahshen/pced/blob/master/burmeseterms.txt they are semantic break up instead of morphological break up.
I tried brute force check with above list on Tipitaka, sadly only 50% of long words can be found, might need to develop a smarter algorithm.
Instead of one level break up of long words, I am thinking of preserving the nested structure of long words.
mahābodhisattasammasana=mahābodhisatta+sammasana mahābodhisatta=mahanta+bodhisatta bodhisatta=bodhi+satta
mahābodhisattasammasana=((mahā (bodhi satta)) sammasana)
A complete list of stem words will open up many new possibilities, e.g, apply huffman encoding to the text, assign each word a unique id, like Strong Number, etc.
2016-04-05 17:47 GMT+08:00 sujato notifications@github.com:
We insert soft hyphens so as to be able to break long words in Pali. Perhaps the means of doing this could be more sophisticated, but it works reasonably well.
I have noticed the bug you refer to, not on our main site, but on our translation app Pootle http://pootle.suttacentral.net, see here https://discourse.suttacentral.net/t/pootle-2-7-development-issues/2562/4?u=sujato.
There, it is definitely a problem, but I have not noticed it as a problem on the main site. Are you referring to searching on SuttaCentral, or just searching with our texts in another context? Because I get the same results for the tokenized or plain terms on the site.
So there are two distinct issues:
- Where to insert the soft-hyphens
- When to do it.
As you point out, they're not coded according to sandhi rules. Instead they just follow a simple syllable-breaking algorithm. There's no particular need to have lots of such breaks; we're not justifying text, just avoiding occasional excesses. So it would make more sense to break compounds on sandhi breaks. If you know of a script for implementing this it would be helpful.
As for when to do it, currently these are hard-coded in our HTML files. It might be better to do it on the fly, and insert required breaks when needed via javascript. We could perhaps use something like this https://github.com/davidmerfield/Typeset This gets complicated, however, when doing things like resizing pages, applying the text in different contents like ebooks or LaTeX, etc. Basically, wherever you use Pali text long words will be a problem, and we can't expect applications to understand Pali word-breaking. On the whole, i think it may be best to keep them hard-coded. It's easier to take them out than add them in.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/suttacentral/suttacentral/issues/152#issuecomment-205733405
That would be great if it's doable. we are doing something similar with our ongoing revision of the Condensed pali Dictionary. Entries are broken down to successively simpler forms in much the same way as you describe. However we have not automated this, it is being done slowly and incompletely by hand.
The search still needs some improvements, here are some quirks that I notice as I go. I posted this list in Feb 2016, and in March 2018 can confirm the same problems are still there!
Other enhancements.
Possible ideas to be explored: