grantjenks / python-wordsegment

English word segmentation, written in pure-Python, and based on a trillion-word corpus.
http://www.grantjenks.com/docs/wordsegment/
Other
365 stars 49 forks source link

Support for Other Languages #32

Open ykhatami opened 3 years ago

ykhatami commented 3 years ago

The LDC has the Web 1T 5-gram 10 European Languages published at https://catalog.ldc.upenn.edu/LDC2009T25

Is there any plan to support these languages? If not, can I jump in and contribute? Would it be enough to parse the above data and get the unigram/bigram counts?

grantjenks commented 3 years ago

No, I don’t have plans to ship those corpuses at this time. The linked datasets do not appear to redistributable for free. Under “View Fees”, the costs is $150 for non-members.

willwade commented 9 months ago

Not sure if this is of any use but this maybe handy for this task https://github.com/Poio-NLP/poio-corpus (they used it to build a prediction engine - pressagio).