polm / fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
MIT License
389 stars 31 forks source link

Vectorizing Japanese After Lemmatization #73

Closed ruukasu3 closed 1 year ago

ruukasu3 commented 1 year ago

Once lemmatized, how are Japanese lemmas vectorized?

polm commented 1 year ago

There isn't a single standard way to do this.

The most common one is to create a fixed vocabulary and assign every word an index (integer) and use that. You can also use fixed sized hashes if you're reasonably sure they won't collide, which is what spaCy does - for example, you can read about how the Vocab works.

Usually the tricky part is not the vectorization, but building the vocabulary. The simplest thing is to use BPE, like with SentencePiece, but that has been critized, and the right way to handle it is an area of active research. It's also easier to encounter issues in Japanese than in English due to the larger number of characters used. You can see a variety of strategies used in the awesome-bert-japanese repo, or see some details of how GPT works with Japanese in this recent article by @passaglia.

Also your question assumes you are lemmatizing text before vectorizing it. You can definitely do that, but replacing words with lemmas is not common in modern large models, which generally have enough parameters to learn from unlemmatized text. Lemmatization was more important in older models with limited numbers of features.