vymana / indic_nlp

0 stars 0 forks source link

Analyze the basic statistics of datasets - 2 #9

Open matrixdecoded opened 3 years ago

matrixdecoded commented 3 years ago

Statistics:

  1. Brief overview of datasets - different parts/subsets/files
  2. Datasets size - size of training, test and dev sets
  3. Different classes of labels and their counts
  4. Sample data (20 records from each data file).
  5. Save all the data files and create dataset on datastop.org

Datasets to review:

  1. Review Sentiment Datasets - Hindi http://www.iitp.ac.in/~ai-nlp-ml/resources.html
  2. IIT Bombay English-Hindi Corpus
  3. HASOC 2019 Dataset : https://hasocfire.github.io/hasoc/2020/dataset.html
  4. (NA) http://amitavadas.com/sentiwordnet.php (send a request to access the data)
himanshu125 commented 3 years ago

I have sent a mail to Amitava Das regarding the SentiWordNet (Indian Languages: Hindi, Bengali, Telugu, and Tamil) Datasets and hasocfire@gmail.com for the key of datasets. In the Review Sentiment Datasets - Hindi IIT Patna datasets are not present.

himanshu125 commented 3 years ago

For opening the HASOC 2019 datasets folder the key is hasoc@2019 and I have downloaded the IIT Bombay English-Hindi Corpus datasets and for this http://amitavadas.com/sentiwordnet.php (send a request to access the data) I did not get any reply from Amitaya Das.