My two cents - Githubissues

Hi, thanks for the comments.

Copolot suggested this repository while adding additional tokens (James) to my tokenizer.

Interesting, this is my first time having that happen. I am flattered.

I'm afraid to say that this is basically character-level encoding or the same as one hot encoding with every single Korean character in the vocabulary because embedding is doing the same thing already.

I think you are slightly misunderstanding what our work does. We are doing character (= syllable/음절) level modeling, but we are doing it in a way that reduces parameter counts by only using subcharacter/자모 features. You can read about it here: https://aclanthology.org/2023.eacl-main.172/.

On the encoding side there are roughly 3 options:

One-hot syllable: requires 11k embedding vectors One-hot jamo: requires ~70 embedding vectors, but 3x sequence length Three-hot syllable: requires 70 embedding vectors but syllable-level sequence length

Our's is three-hot syllable, so we do produce a single syllable-level encoding for each syllable in the text, but its made from a combination of the component jamo parts.

However, our work mainly focused on the output side, where there is a fourth option: independent three-hot syllable (https://koreascience.kr/article/CFKO201832073079068.pdf). We show that this one doesn't properly model syllables, and propose conditional three-hot syllable decoding which also only requires ~70 embedding vectors and outputs full syllables in one timestep.

So, to summarize, we are doing character-level encoding but with a reduced parameter count (11k -> 70 embedding vectors but no sequence length increase).

mcognetta / ThreeHotKoreanModeling

My two cents #1