Analyze the basic statistics of datasets - 1 - Githubissues

vymana / indic_nlp

0 stars 0 forks source link

Analyze the basic statistics of datasets - 1 #10

Open matrixdecoded opened 3 years ago

matrixdecoded commented 3 years ago

Statistics:

Brief overview of datasets - different parts/subsets/files
Datasets size - size of training, test and dev sets
Different classes of labels and their counts
Sample data (20 records from each data file).
Save all the data files and create dataset on datastop.org

Datasets to review:

HindEnCorp 0.5: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-625F-0
Word Similarity (all languages) - https://github.com/syedsarfarazakhtar/Word-Similarity-Datasets-for-Indian-Languages
NER Hindi - FIRE 2013 AUKBC NER Corpus
Text Classification - https://github.com/goru001/inltk (Hindi, Sanskrit)