google-code-export / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

Better handling of traditional Chinese in LanguageToolSegmenter #559

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Base on the new normalization framework, we could consider implementing a 
normalizing segmenter.

Original discussion: 
https://groups.google.com/d/topic/dkpro-core-user/vaK628jp280/discussion

Original issue reported on code.google.com by richard.eckart on 9 Dec 2014 at 11:07

GoogleCodeExporter commented 9 years ago
I checked out the code used by LanguageTool in more detail. Apparently, the 
conversion of traditional Chinese to simplified Chinese happens on a 
per-character basis. It is also possible to reverse the process - that works at 
least for the example provided by Samudra on the mailing list. I'm adding a 
special handling for Chinese, trying to revert the transformation to simplified 
Chinese in case the output token from the LangageTool segmenter is not found in 
the document text.

Original comment by richard.eckart on 9 Dec 2014 at 10:21

GoogleCodeExporter commented 9 years ago
This issue was updated by revision r3227.

- Try handle tokens that were converted from traditional Chinese to simplified 
Chinese by LanguageTool

Original comment by richard.eckart on 9 Dec 2014 at 10:24

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r3227.

Original comment by richard.eckart on 9 Dec 2014 at 10:24