vymana / indic_nlp

0 stars 0 forks source link

Analyze the basic statistics of datasets - 3 #11

Open matrixdecoded opened 3 years ago

matrixdecoded commented 3 years ago

Statistics:

  1. Brief overview of datasets - different parts/subsets/files
  2. Datasets size - size of training, test and dev sets
  3. Different classes of labels and their counts
  4. Sample data (20 records from each data file).
  5. Save all the data files and create dataset on datastop.org

Datasets to review:

  1. Language Identification - https://github.com/kmi-linguistics/vardial2018
  2. (NA) Hindi Treebank - http://www.tdil-dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1977&lang=en
  3. Word Similarity - https://github.com/syedsarfarazakhtar/Word-Similarity-Datasets-for-Indian-Languages
  4. (NA) Paraphrase detection - https://nlp.amrita.edu/dpil_cen/index.html
abhishekabhay910 commented 3 years ago

I have attached 3 datasets in zip format which contains different datasets and their description. Language Identification.zip Paraphrase Detection.zip Word Similarity.zip