Open ykhatami opened 3 years ago
No, I don’t have plans to ship those corpuses at this time. The linked datasets do not appear to redistributable for free. Under “View Fees”, the costs is $150 for non-members.
Not sure if this is of any use but this maybe handy for this task https://github.com/Poio-NLP/poio-corpus (they used it to build a prediction engine - pressagio).
The LDC has the Web 1T 5-gram 10 European Languages published at https://catalog.ldc.upenn.edu/LDC2009T25
Is there any plan to support these languages? If not, can I jump in and contribute? Would it be enough to parse the above data and get the unigram/bigram counts?