odakaui / vocabulist

A vocabulary database for learning Japanese
MIT License
0 stars 0 forks source link

Rewrite Tokenizer from Scratch in Rust #47

Open odakaui opened 4 years ago

odakaui commented 4 years ago

I would like to rewrite the tokenizer in rust so that I do not have to rely on external dependencies.

It will be a separate crate that could be used by itself. It will be written in rust. It will use the unidic or another tokenizer dictionary. It will return the surface string, normalized version and pos of each word. It will be fast and efficient. It will be licensed under either the MIT or Apache 2.0 License.

I'll need to do some research.

I already looked at sudachi clone, but it doesn't appear that you can put the dictionary at any path. I also looked at yoin but I'd like to try my hand at writing it myself for the learning experience.

odakaui commented 4 years ago

It would also make it potentially possible to run the tokenizer on iOS or in a MacOS application.