mohsinkhn / EasyLearn

This repository is for Auto Machine learning project
2 stars 0 forks source link

Datasets for Testing #1

Closed mohsinkhn closed 6 years ago

mohsinkhn commented 6 years ago

As discussed we wanted to have atleast 3 regression datasets for testing our modules. Please put your suggestions in this thread.

My suggestion is to have House Price prediction dataset - https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

mohsinkhn commented 6 years ago

Here is list of datasets pinged to me by Misha https://www.linkedin.com/feed/update/urn:li:activity:6363292959781326848

apoorv-agarwal commented 6 years ago

Credit card fraud detection data set: https://www.kaggle.com/mlg-ulb/creditcardfraud

mohsinkhn commented 6 years ago

Credit card fraud from Misha https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset/data

apoorv-agarwal commented 6 years ago

Mental health survey data (mostly categorical): https://www.kaggle.com/osmi/mental-health-in-tech-survey

mishagulati commented 6 years ago

Primary Tumor data http://archive.ics.uci.edu/ml/machine-learning-databases/primary-tumor/

sahil-m commented 6 years ago

https://github.com/fivethirtyeight/data

mohsinkhn commented 6 years ago

https://www.kaggle.com/usdot/flight-delays/data

sahil-m commented 6 years ago

http://mlr.cs.umass.edu/ml/datasets/Automobile

sahil-m commented 6 years ago

https://data.nal.usda.gov/dataset/composition-foods-raw-processed-prepared-usda-national-nutrient-database-standard-reference-release-27

apoorv-agarwal commented 6 years ago

https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding/data

mohsinkhn commented 6 years ago

So finally deciding on these 2 datasets after discussing:

  1. https://www.kaggle.com/mlg-ulb/creditcardfraud (using this one as it is small in size)
  2. https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

We decided to have third dataset for time series/sequential applications

mohsinkhn commented 6 years ago

Closing the issue ass of now, finalizing on above two datasets. Will open a separate issue for time-series data later