CEDICT reading problem - Githubissues

studioego / cjklib

Automatically exported from code.google.com/p/cjklib

Other

0 stars 0 forks source link

What steps will reproduce the problem?
1. import cjklib.dictionary
2. d = cjklib.dictionary.CEDICT(databaseUrl='sqlite:////path/to/your/cedict.db')
3. d.getAll()

The method above should return all entries in CEDICT database. However, an 
AttributeError exception is raised while applying format on this record:
卡拉ＯＫ|卡拉ＯＫ|ka3 la1 O K|/karaoke (loanword)/

The problem is, reading is not a standard Pinyin. Method 
SingleColumnAdapter.format returns None therefore; 
NonReadingEntityWhitespace.format raises the exception trying to call split 
method on None type.

Problem exists in SVN trunk version (Rev: 446). I am using Ubuntu Linux 11.04.1 
LTS

I suggest either fixing such records in installcjkdict script, or fix the 
formatter of dictionary module to be able handle such records. My hotfix:

(line 126):
    def format(self, string):                                                   
        toReading = self.toReading or self.fromReading                          
        try:                                                                    
            return self._readingFactory.convert(string, self.fromReading,       
                toReading, sourceOptions=self.sourceOptions,                    
                targetOptions=self.targetOptions)                               
        except (exception.DecompositionError, exception.CompositionError,       
            exception.ConversionError):                                         
            # wighack                                                           
            return string                                                       
            #return None

Original issue reported on code.google.com by caj...@gmail.com on 3 Oct 2012 at 9:37

It seems that the PinyinOperator thinks that 'O' is an entity in Pinyin, and complains that no tonal information is available. This leads to an error in conversion resulting a None value. Your fix would be an improvement, but really we should be fixing the conversion. What I tried to do was to tell the reading conversion to ignore the "invalid" characters. That should be solvable by adding 'missingToneMark': 'ignore' to the converter settings. However, this leads to a breakage in another part of the software, as two different code paths make use of the same reading converter instance. More precisely the "search by reading" component (TonelessWildcardReading) needs a reading conversion that supports missing tones, something we want to change above by ignoring syllables without tonal marks. The solution here would be to separate both paths, but that needs a bit more time. Will keep that on my radar. Feel free to have a go at this yourself.

studioego / cjklib

CEDICT reading problem #18