Closed nakosung closed 6 years ago
We performed morphological segmentation which is different from splitting Hangul syllables into letters.
An example:
이건 ---> 이거 ㄴ
It's also different from Hungarian segmentation because it's not a binary classification of each character (is it a morpheme start or not?).
Natural representation of Korean language is in 'disassembled form'.
e.g.) 이건 --> ㅇ ㅣ ㄱ ㅓ ㄴ
In this way, you can segment Korean word just as what you do about Hungarians.
This is the repo preprocessing Korean dataset.
https://github.com/nakosung/hangul-asm