thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

Add Samanantar datasets. #142

Closed BrightXiaoHan closed 1 year ago

BrightXiaoHan commented 1 year ago

Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.

https://ai4bharat.iitm.ac.in/samanantar

thammegowda commented 1 year ago
thammegowda commented 1 year ago

@BrightXiaoHan Thanks for creating this issue. If this is urgent, could you please update this link with v0.3 (or newest) from https://ai4bharat.iitm.ac.in/samanantar https://github.com/thammegowda/mtdata/blob/c57dab559e05e80ccbcb26fc44bb7fc94d676ef2/mtdata/index/ai4bharat.py#L17

and test if works! Thanks

BrightXiaoHan commented 1 year ago

Thanks for your reply. I will try to test it.