vymana / indic_nlp

0 stars 0 forks source link

Analyze the basic statistics of datasets - 1 #10

Open matrixdecoded opened 3 years ago

matrixdecoded commented 3 years ago

Statistics:

  1. Brief overview of datasets - different parts/subsets/files
  2. Datasets size - size of training, test and dev sets
  3. Different classes of labels and their counts
  4. Sample data (20 records from each data file).
  5. Save all the data files and create dataset on datastop.org

Datasets to review:

  1. HindEnCorp 0.5: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-625F-0
  2. Word Similarity (all languages) - https://github.com/syedsarfarazakhtar/Word-Similarity-Datasets-for-Indian-Languages
  3. NER Hindi - FIRE 2013 AUKBC NER Corpus
  4. Text Classification - https://github.com/goru001/inltk (Hindi, Sanskrit)