twitter / twitter-korean-text

Korean tokenizer
Apache License 2.0
858 stars 173 forks source link

Tokenizer throws exception with certain input #97

Closed rigeljs closed 8 years ago

rigeljs commented 8 years ago

The tokenizer throws an UnsupportedOperationException with the following input:

해쵸쵸쵸쵸쵸쵸쵸쵸춏

It also seems to throw the exception with more than 8 of the '쵸' character in the middle, but doesn't fail with less than 8. Here's a more complete stack trace:

java.lang.UnsupportedOperationException: empty.minBy
    at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:252)
    at scala.collection.AbstractTraversable.minBy(Traversable.scala:104)
    at com.twitter.penguin.korean.tokenizer.KoreanTokenizer$.com$twitter$penguin$korean$tokenizer$KoreanTokenizer$$parseKoreanChunk(KoreanTokenizer.scala:197)
    at com.twitter.penguin.korean.tokenizer.KoreanTokenizer$$anonfun$tokenize$1.apply(KoreanTokenizer.scala:99)
    at com.twitter.penguin.korean.tokenizer.KoreanTokenizer$$anonfun$tokenize$1.apply(KoreanTokenizer.scala:96)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
    at scala.collection.immutable.List.flatMap(List.scala:344)
    at com.twitter.penguin.korean.tokenizer.KoreanTokenizer$.tokenize(KoreanTokenizer.scala:96)
    at com.twitter.penguin.korean.TwitterKoreanProcessor$.tokenize(TwitterKoreanProcessor.scala:49)
    at com.twitter.penguin.korean.TwitterKoreanProcessor.tokenize(TwitterKoreanProcessor.scala)
    at com.twitter.penguin.korean.TwitterKoreanProcessorJava.tokenize(TwitterKoreanProcessorJava.java:56)

Thanks!

hohyon-ryu commented 8 years ago

Thank you for the report! I will look into it as soon as I can find time.