Better handling of traditional Chinese in LanguageToolSegmenter

GoogleCodeExporter commented 9 years ago

Base on the new normalization framework, we could consider implementing a 
normalizing segmenter.

Original discussion: 
https://groups.google.com/d/topic/dkpro-core-user/vaK628jp280/discussion

Original issue reported on code.google.com by richard.eckart on 9 Dec 2014 at 11:07

GoogleCodeExporter commented 9 years ago

I checked out the code used by LanguageTool in more detail. Apparently, the 
conversion of traditional Chinese to simplified Chinese happens on a 
per-character basis. It is also possible to reverse the process - that works at 
least for the example provided by Samudra on the mailing list. I'm adding a 
special handling for Chinese, trying to revert the transformation to simplified 
Chinese in case the output token from the LangageTool segmenter is not found in 
the document text.

Original comment by richard.eckart on 9 Dec 2014 at 10:21

Changed title: Better handling of traditional Chinese in LanguageToolSegmenter
Changed state: Accepted
Added labels: Milestone-1.8.0, DKPro-ASL, Module-languagetool

GoogleCodeExporter commented 9 years ago

This issue was updated by revision r3227.

- Try handle tokens that were converted from traditional Chinese to simplified 
Chinese by LanguageTool

Original comment by richard.eckart on 9 Dec 2014 at 10:24

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r3227.

Original comment by richard.eckart on 9 Dec 2014 at 10:24

Changed state: Fixed

google-code-export / dkpro-core-asl

Better handling of traditional Chinese in LanguageToolSegmenter #559