Closed sevmardi closed 4 years ago
There isn’t really a training process in the machine learning sense. It’s simply a statistical model based off unigram and bigram frequency counts. If you replace the unigrams and bigrams mappings then you can apply it to any language given a corpus.
There’s an example of how to do so in the docs: http://www.grantjenks.com/docs/wordsegment/using-a-different-corpus.html
Hi, I've see #2 asking to train on new data. But this was back in 2015. Is it possible to train the algo on more, new, modern datasets?
I don't have any particular corpus/dataset in mind.
But for now I do encounter things e.g. "bitcoin" classified as
bit, coin
instead ofbitcoin
(obviously both are true) and "instagram" asinsta, gram
.