suttacentral / legacy-suttacentral

Source code and related files (CSS, images, etc.) for SuttaCentral
http://suttacentral.net/
Other
14 stars 4 forks source link

Dictionaries #98

Open sujato opened 9 years ago

sujato commented 9 years ago

The popup dics for SC are one of our best features, and deserve improvement and development.

We have some additional dictionaries to include: Pali>Chinese we have already, Pali>Portugese will be ready in a few months.

In addition, we should consider integrating the wider range of dictionaries that are out there. The best way to go about this, I think, would be to include the dictionaries from https://www.gandhari.org/n_dictionary.php

They have several dictionaries, and as they are compiling their own Gandhari dictionary they are I think the most reliable source. It would be nice if we could develop an auto-extractor to suck up their texts. However, same cavet as with their texts: they are planning to release a revamped site, so best wait for now.

Note that they have a version of the PTS dict. This has been proofread and is in some respects cleaner than ours. However I am not sure it is reliable: see the entry for kāla and compare with ours. Theirs is missing a lot of text.

We should increase the rate of successful lookups. Currently it is about 80-85%. With some careful coding I think we could get it up to 95%. Perhaps we could tweak it with users submissions or hand editing.

yapcheahshen commented 9 years ago

1)I got "Uncaught TypeError: self.setLang is not a function" in devTools console when trying to load Pali to English Lookup for http://suttacentral.net/pi/dn1

2)do you break long Pali terms into smaller unit? e.g, does Suttacentral has javascript module to break kusalākusalasāvajjānavajjasevitabbāsevitabbahīnapaṇītakaṇhasukkasappaṭibhāgānaṃ into kusala akusala sāvajja anavajja sevitabba asevitabba hīna paṇīta kaṇha sukka sappaṭi bhāgānaṃ automatically?

sujato commented 9 years ago

1)I got "Uncaught TypeError: self.setLang is not a function" in devTools console when trying to load Pali to English Lookup for http://suttacentral.net/pi/dn1

Thanks, we'll look into this.

2)do you break long Pali terms into smaller unit?

yes we do. The parser is pretty clever, it can sometimes recognize sandhi, but it still fails too often.

yapcheahshen commented 9 years ago

please provide the link of the parser and list of words to fail the parser if possible.

2015-05-12 5:42 GMT+08:00 sujato notifications@github.com:

1)I got "Uncaught TypeError: self.setLang is not a function" in devTools console when trying to load Pali to English Lookup for http://suttacentral.net/pi/dn1

Thanks, we'll look into this.

2)do you break long Pali terms into smaller unit?

yes we do. The parser is pretty clever, it can sometimes recognize sandhi, but it still fails too often.

— Reply to this email directly or view it on GitHub https://github.com/suttacentral/suttacentral/issues/98#issuecomment-101057694 .

sujato commented 9 years ago

As far as I know, we haven't tested the failed words. The bigger problem is the false positives, of which there are many. The parser in Yuttadhammo's DPR is much better, but uses much much more code.

The code should be here, but blake will need to help you with specifics: https://github.com/suttacentral/suttacentral/tree/master/static/js

yapcheahshen commented 9 years ago

here is a list of 150K words and their decomposition extracted from Burmese Pali Dictionary found in PCED(a widely used Pali -Chinese Dictionary software) . https://github.com/yapcheahshen/pced/blob/master/burmeseterms.txt maybe it is useful to find out the false positives?

2015-05-12 7:33 GMT+08:00 sujato notifications@github.com:

As far as I know, we haven't tested the failed words. The bigger problem is the false positives, of which there are many. The parser in Yuttadhammo's DPR is much better, but uses much much more code.

The code should be here, but blake will need to help you with specifics: https://github.com/suttacentral/suttacentral/tree/master/static/js

— Reply to this email directly or view it on GitHub https://github.com/suttacentral/suttacentral/issues/98#issuecomment-101074591 .

sujato commented 9 years ago

Very much so, thanks, I'll ask @blake to look at this.

blake-sc commented 9 years ago

Pali lookup works again. That decomposition list looks pretty useful. Machine parsing will always be limited, that kind of data is often best used as a lookup on how to break up a compound. Updating the compound breaker upper has been on the to-do for a long time. If you use the chinese lookup you'll see it deals with compounds very gracefully by giving multiple possible breakdowns (or in the case of chinese, multiple different compounds of the glyphs). With pali we can't do anything quite so graceful but we can take a similar approach of showing all possible matches rather than a best guess.