Closed zhigang-qi closed 9 years ago
Yes, when stemming is applied, 입니다 (12,3) will be transformed to the sterm "이다". So, this is confusing but expected behavior. Also, the index values are after the normalization is applied. Yeah, probably we would need a better transformation/indexing system. Or just drop indices when stemming is applied.
Thanks for the explanation
On Wed, Apr 22, 2015 at 2:14 PM, Will Hohyon Ryu notifications@github.com wrote:
Yes, when stemming is applied, 입니다 (12,3) will be transformed to the sterm "이다". So, this is confusing but expected behavior. Also, the index values are after the normalization is applied. Yeah, probably we would need a better transformation/indexing system. Or just drop indices when stemming is applied.
— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-95337736 .
"A captain would never run away from his duty, if he knew the ship was sinking."
The idea way could be: Tokenizer -> tokens[ {value:ㅋㅋㅋㅋㅋ, stem:이다, pos:KoreanParticle, offset:15, length:5} {} ]
Any comments?
That's brilliant. :) I will try that in the next version.
On Wed, Apr 22, 2015 at 4:51 PM zhigang-qi notifications@github.com wrote:
The idea way could be: Tokenizer -> tokens[ {value:ㅋㅋㅋㅋㅋ, stem:이다, pos:KoreanParticle, offset:15, length:5} {} ]
Any comments?
Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-95370067 .
Cool!
On Wed, Apr 22, 2015 at 5:31 PM, Will Hohyon Ryu notifications@github.com wrote:
That's brilliant. :) I will try that in the next version.
On Wed, Apr 22, 2015 at 4:51 PM zhigang-qi notifications@github.com wrote:
The idea way could be: Tokenizer -> tokens[ {value:ㅋㅋㅋㅋㅋ, stem:이다, pos:KoreanParticle, offset:15, length:5} {} ]
Any comments?
Reply to this email directly or view it on GitHub < https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-95370067
.
— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-95375041 .
"A captain would never run away from his duty, if he knew the ship was sinking."
The impractial offset caused by transforming a token ("입니닼") into its' root ("이다") while conducting token splitting and POS tagging. Actually, such transformation is neither POS tagging nor tokenizing but lemmatizing (or stemming). Putting all those different component into one interface outputing the same data structure somehow makes easier to use but also produces confusion.
In my opinion, along with such nice and easy interface, it would be better to have all those three components separately as runnables so that we can chain them together. Especially for the case we have more than one option for each component.
Yes. Modular design is much better.
On Thu, Apr 23, 2015 at 4:36 AM, Hongjoo Lee notifications@github.com wrote:
The impractial offset caused by transforming a token ("입니닼") into its' root ("이다") while conducting token splitting and POS tagging. Actually, such transformation is neither POS tagging nor tokenizing but lemmatizing (or stemming). Putting all those different component into one interface outputing the same data structure somehow makes easier to use but also produces confusion.
In my opinion, along with such nice and easy interface, it would be better to have all those three components separately as runnables so that we can chain them together. Especially for the case we have more than one option for each component.
— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-95555397 .
"A captain would never run away from his duty, if he knew the ship was sinking."
This has been address by 4.0 release
Cool, will try it soon.
On Sat, May 2, 2015 at 12:16 AM, Will Hohyon Ryu notifications@github.com wrote:
This has been address by 4.0 release
— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-korean-text/issues/62#issuecomment-98324819 .
"A captain would never run away from his duty, if he knew the ship was sinking."
In KoreanTokenJava class, the result of offset() method seems weird. Input: "한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ"
Output: [한국어(Noun: 0, 3), 를(Josa: 3, 1), 처리(Noun: 5, 2), 하다(Verb: 7, 2), 예시(Noun: 10, 2), 이다(Adjective: 12, 3), ㅋㅋ(KoreanParticle: 15, 2), 한국어(Noun: 17, 3), 를(Josa: 20, 1), 처리(Noun: 22, 2), 하다(Verb: 24, 2), 예시(Noun: 27, 2), 이다(Adjective: 29, 3), ㅋㅋ(KoreanParticle: 32, 2)]