I would like to rewrite the tokenizer in rust so that I do not have to rely on external dependencies.
It will be a separate crate that could be used by itself.
It will be written in rust.
It will use the unidic or another tokenizer dictionary.
It will return the surface string, normalized version and pos of each word.
It will be fast and efficient.
It will be licensed under either the MIT or Apache 2.0 License.
I'll need to do some research.
I already looked at sudachi clone, but it doesn't appear that you can put the dictionary at any path.
I also looked at yoin but I'd like to try my hand at writing it myself for the learning experience.
I would like to rewrite the tokenizer in rust so that I do not have to rely on external dependencies.
It will be a separate crate that could be used by itself. It will be written in rust. It will use the unidic or another tokenizer dictionary. It will return the surface string, normalized version and pos of each word. It will be fast and efficient. It will be licensed under either the MIT or Apache 2.0 License.
I'll need to do some research.
I already looked at sudachi clone, but it doesn't appear that you can put the dictionary at any path. I also looked at yoin but I'd like to try my hand at writing it myself for the learning experience.