twitter / twitter-korean-text

Korean tokenizer
Apache License 2.0
857 stars 172 forks source link

Detokenizer throws exception with certain inputs #90

Closed laeubli closed 8 years ago

laeubli commented 8 years ago

I've come across a minor bug in the newly added detokenization routine, where some inputs result in a java.lang.UnsupportedOperationException. Example:

com.twitter.penguin.korean.TwitterKoreanProcessor.detokenize(List("이", "제품을", "사용하겠습니다"))
// throws java.lang.UnsupportedOperationException: empty.init

It seems like this could be easily fixed by initialising the list to be output differently. For now, I'm circumventing the problem by always prepending an empty string to the input:

com.twitter.penguin.korean.TwitterKoreanProcessor.detokenize(List("", "이", "제품을", "사용하겠습니다"))
// works

This is neither critical nor urgent, but it would be nice if this could be fixed in some of the future releases of this great library.

hohyon-ryu commented 8 years ago

Thanks for filing the issue. I've created a patch in #92.