himanshudce / Indian-Language-Dataset

Clean parallel corpus for five low resourced Indian Languages
7 stars 1 forks source link

Indian-Language-Dataset

Cleaned and Preprocessed parallel corpus for five less resourced Indian Languages

ID Language Train Test Dev

1 & Tamil & 183451 & 2000 & 1000

2 & Malayalam & 548000 & 3660 & 3000

3 & Telugu & 75000 & 3897 & 3000

4 & Bengali & 658000 & 3255 & 3500

5 & Urdu & 36000 & 2454 & 2000

Link For Dataset - https://drive.google.com/open?id=1b3h13rBwTOZRygT6ZIdk4eZ9MKmXSZJa