Analyze the basic statistics of datasets - 3 - Githubissues

vymana / indic_nlp

0 stars 0 forks source link

Analyze the basic statistics of datasets - 3 #11

Open matrixdecoded opened 3 years ago

matrixdecoded commented 3 years ago

Statistics:

Brief overview of datasets - different parts/subsets/files
Datasets size - size of training, test and dev sets
Different classes of labels and their counts
Sample data (20 records from each data file).
Save all the data files and create dataset on datastop.org

Datasets to review:

Language Identification - https://github.com/kmi-linguistics/vardial2018
(NA) Hindi Treebank - http://www.tdil-dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1977&lang=en
Word Similarity - https://github.com/syedsarfarazakhtar/Word-Similarity-Datasets-for-Indian-Languages
(NA) Paraphrase detection - https://nlp.amrita.edu/dpil_cen/index.html

abhishekabhay910 commented 3 years ago

I have attached 3 datasets in zip format which contains different datasets and their description. Language Identification.zip Paraphrase Detection.zip Word Similarity.zip