issues
search
vymana
/
indic_nlp
0
stars
0
forks
source link
Analyze the basic statistics of datasets - 1
#10
Open
matrixdecoded
opened
3 years ago
matrixdecoded
commented
3 years ago
Statistics:
Brief overview of datasets - different parts/subsets/files
Datasets size - size of training, test and dev sets
Different classes of labels and their counts
Sample data (20 records from each data file).
Save all the data files and create dataset on datastop.org
Datasets to review:
HindEnCorp 0.5:
https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-625F-0
Word Similarity (all languages) -
https://github.com/syedsarfarazakhtar/Word-Similarity-Datasets-for-Indian-Languages
NER Hindi - FIRE 2013 AUKBC NER Corpus
Text Classification -
https://github.com/goru001/inltk
(Hindi, Sanskrit)
Statistics:
Datasets to review: