grantjenks / python-wordsegment

English word segmentation, written in pure-Python, and based on a trillion-word corpus.
http://www.grantjenks.com/docs/wordsegment/
Other
365 stars 49 forks source link

Training on new, modern data. #30

Closed sevmardi closed 4 years ago

sevmardi commented 4 years ago

Hi, I've see #2 asking to train on new data. But this was back in 2015. Is it possible to train the algo on more, new, modern datasets?

I don't have any particular corpus/dataset in mind.

But for now I do encounter things e.g. "bitcoin" classified as bit, coin instead of bitcoin (obviously both are true) and "instagram" as insta, gram.

grantjenks commented 4 years ago

There isn’t really a training process in the machine learning sense. It’s simply a statistical model based off unigram and bigram frequency counts. If you replace the unigrams and bigrams mappings then you can apply it to any language given a corpus.

There’s an example of how to do so in the docs: http://www.grantjenks.com/docs/wordsegment/using-a-different-corpus.html