juditacs / morph-segmentation

Experimenting with supervised morphological segmentation
MIT License
6 stars 5 forks source link

Actually segmentation of Korean sentence can be done with splitter #33

Closed nakosung closed 6 years ago

nakosung commented 7 years ago

Natural representation of Korean language is in 'disassembled form'.

e.g.) 이건 --> ㅇ ㅣ ㄱ ㅓ ㄴ

In this way, you can segment Korean word just as what you do about Hungarians.

This is the repo preprocessing Korean dataset.

https://github.com/nakosung/hangul-asm

juditacs commented 6 years ago

We performed morphological segmentation which is different from splitting Hangul syllables into letters.

An example:

이건  ---> 이거 ㄴ

It's also different from Hungarian segmentation because it's not a binary classification of each character (is it a morpheme start or not?).