google / corpuscrawler

Crawler for linguistic corpora
Other
190 stars 56 forks source link

Add Pali, Mon, and Karen #76

Closed sffc closed 4 years ago

sffc commented 4 years ago

We've been referred to the following sources for corpora in additional Myanmar-script laguages.

  1. Pali (Tri Pitaka) [pi-Mymr]
    1. https://tipitaka.org/mymr/
  2. Mon [mnw]
    1. http://mon.monnews.org/
    2. https://mnw.wikipedia.org/wiki/မုက်လိက်တမ
  3. Shan [shn] -- already included
    1. https://shannews.org/
    2. https://shn.wikipedia.org/wiki/ၼႃႈႁူဝ်ႁႅၵ်ႈ
  4. Karen [kar]
    1. http://karen.kicnews.org/
    2. https://wol.jw.org/ksw/wol/h/r350/lp-kr (Bible, Publications...)

CC @sven-oly

sffc commented 4 years ago

Mon and Shan are already included. I have Pali in my branch. I'll do Karen next, and then send a PR.