twitter / twitter-korean-text

Korean tokenizer
Apache License 2.0
857 stars 172 forks source link

How to the offset() method result when enabling stemmer and normalizer? #62

Closed zhigang-qi closed 9 years ago

zhigang-qi commented 9 years ago

In KoreanTokenJava class, the result of offset() method seems weird. Input: "한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ"

Output: [한국어(Noun: 0, 3), 를(Josa: 3, 1), 처리(Noun: 5, 2), 하다(Verb: 7, 2), 예시(Noun: 10, 2), 이다(Adjective: 12, 3), ㅋㅋ(KoreanParticle: 15, 2), 한국어(Noun: 17, 3), 를(Josa: 20, 1), 처리(Noun: 22, 2), 하다(Verb: 24, 2), 예시(Noun: 27, 2), 이다(Adjective: 29, 3), ㅋㅋ(KoreanParticle: 32, 2)]

hohyon-ryu commented 9 years ago

Yes, when stemming is applied, 입니다 (12,3) will be transformed to the sterm "이다". So, this is confusing but expected behavior. Also, the index values are after the normalization is applied. Yeah, probably we would need a better transformation/indexing system. Or just drop indices when stemming is applied.

zhigang-qi commented 9 years ago

Thanks for the explanation

On Wed, Apr 22, 2015 at 2:14 PM, Will Hohyon Ryu notifications@github.com wrote:

Yes, when stemming is applied, 입니다 (12,3) will be transformed to the sterm "이다". So, this is confusing but expected behavior. Also, the index values are after the normalization is applied. Yeah, probably we would need a better transformation/indexing system. Or just drop indices when stemming is applied.

— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-95337736 .

"A captain would never run away from his duty, if he knew the ship was sinking."

zhigang-qi commented 9 years ago

The idea way could be: Tokenizer -> tokens[ {value:ㅋㅋㅋㅋㅋ, stem:이다, pos:KoreanParticle, offset:15, length:5} {} ]

Any comments?

hohyon-ryu commented 9 years ago

That's brilliant. :) I will try that in the next version.

On Wed, Apr 22, 2015 at 4:51 PM zhigang-qi notifications@github.com wrote:

The idea way could be: Tokenizer -> tokens[ {value:ㅋㅋㅋㅋㅋ, stem:이다, pos:KoreanParticle, offset:15, length:5} {} ]

Any comments?

Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-95370067 .

zhigang-qi commented 9 years ago

Cool!

On Wed, Apr 22, 2015 at 5:31 PM, Will Hohyon Ryu notifications@github.com wrote:

That's brilliant. :) I will try that in the next version.

On Wed, Apr 22, 2015 at 4:51 PM zhigang-qi notifications@github.com wrote:

The idea way could be: Tokenizer -> tokens[ {value:ㅋㅋㅋㅋㅋ, stem:이다, pos:KoreanParticle, offset:15, length:5} {} ]

Any comments?

Reply to this email directly or view it on GitHub < https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-95370067

.

— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-95375041 .

"A captain would never run away from his duty, if he knew the ship was sinking."

midnightradio commented 9 years ago

The impractial offset caused by transforming a token ("입니닼") into its' root ("이다") while conducting token splitting and POS tagging. Actually, such transformation is neither POS tagging nor tokenizing but lemmatizing (or stemming). Putting all those different component into one interface outputing the same data structure somehow makes easier to use but also produces confusion.

In my opinion, along with such nice and easy interface, it would be better to have all those three components separately as runnables so that we can chain them together. Especially for the case we have more than one option for each component.

zhigang-qi commented 9 years ago

Yes. Modular design is much better.

On Thu, Apr 23, 2015 at 4:36 AM, Hongjoo Lee notifications@github.com wrote:

The impractial offset caused by transforming a token ("입니닼") into its' root ("이다") while conducting token splitting and POS tagging. Actually, such transformation is neither POS tagging nor tokenizing but lemmatizing (or stemming). Putting all those different component into one interface outputing the same data structure somehow makes easier to use but also produces confusion.

In my opinion, along with such nice and easy interface, it would be better to have all those three components separately as runnables so that we can chain them together. Especially for the case we have more than one option for each component.

— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-95555397 .

"A captain would never run away from his duty, if he knew the ship was sinking."

hohyon-ryu commented 9 years ago

This has been address by 4.0 release

zhigang-qi commented 9 years ago

Cool, will try it soon.

On Sat, May 2, 2015 at 12:16 AM, Will Hohyon Ryu notifications@github.com wrote:

This has been address by 4.0 release

— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-98324819 .

"A captain would never run away from his duty, if he knew the ship was sinking."