studioego / cjklib

Automatically exported from code.google.com/p/cjklib
Other
0 stars 0 forks source link

CEDICT reading problem #18

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. import cjklib.dictionary
2. d = cjklib.dictionary.CEDICT(databaseUrl='sqlite:////path/to/your/cedict.db')
3. d.getAll()

The method above should return all entries in CEDICT database. However, an 
AttributeError exception is raised while applying format on this record:
卡拉OK|卡拉OK|ka3 la1 O K|/karaoke (loanword)/

The problem is, reading is not a standard Pinyin. Method 
SingleColumnAdapter.format returns None therefore; 
NonReadingEntityWhitespace.format raises the exception trying to call split 
method on None type.

Problem exists in SVN trunk version (Rev: 446). I am using Ubuntu Linux 11.04.1 
LTS

I suggest either fixing such records in installcjkdict script, or fix the 
formatter of dictionary module to be able handle such records. My hotfix:

(line 126):
    def format(self, string):                                                   
        toReading = self.toReading or self.fromReading                          
        try:                                                                    
            return self._readingFactory.convert(string, self.fromReading,       
                toReading, sourceOptions=self.sourceOptions,                    
                targetOptions=self.targetOptions)                               
        except (exception.DecompositionError, exception.CompositionError,       
            exception.ConversionError):                                         
            # wighack                                                           
            return string                                                       
            #return None                                                        

Original issue reported on code.google.com by caj...@gmail.com on 3 Oct 2012 at 9:37

GoogleCodeExporter commented 9 years ago
It seems that the PinyinOperator thinks that 'O' is an entity in Pinyin, and 
complains that no tonal information is available. This leads to an error in 
conversion resulting a None value.

Your fix would be an improvement, but really we should be fixing the conversion.

What I tried to do was to tell the reading conversion to ignore the "invalid" 
characters. That should be solvable by adding 'missingToneMark': 'ignore' to 
the converter settings. However, this leads to a breakage in another part of 
the software, as two different code paths make use of the same reading 
converter instance. More precisely the "search by reading" component 
(TonelessWildcardReading) needs a reading conversion that supports missing 
tones, something we want to change above by ignoring syllables without tonal 
marks. The solution here would be to separate both paths, but that needs a bit 
more time. 

Will keep that on my radar. Feel free to have a go at this yourself.

Original comment by christop...@gmail.com on 3 Oct 2012 at 5:57