treforevans / uci_datasets

Regression datasets from the UCI repository with standardized test-train splits.
MIT License
36 stars 10 forks source link

UCI datasets

Regression datasets from the UCI machine learning repository prepared for benchmarking studies with test-train splits.

Installation

Install using pip (the download size is about 312 Mb):

python -m pip install git+https://github.com/treforevans/uci_datasets.git

Usage

The following code gets the first test-train split (i.e., split=0) of the challenger dataset:

from uci_datasets import Dataset
data = Dataset("challenger")
x_train, y_train, x_test, y_test = data.get_split(split=0)

There are 10 test-train splits for each dataset (as in 10-fold cross validation) with 90% of the dataset being training points and 10% being testing points in each split. The split parameter of the Dataset.get_split method accepts integers from 0 to 9 (inclusive).

Datasets

The below table contains the size (number of observations) and the number of input dimensions of each dataset. All datasets have a single output dimension.

Dataset name Number of observations Input dimension
3droad 434874 3
autompg 392 7
bike 17379 17
challenger 23 4
concreteslump 103 7
energy 768 8
forest 517 12
houseelectric 2049280 11
keggdirected 48827 20
kin40k 40000 8
parkinsons 5875 20
pol 15000 26
pumadyn32nm 8192 32
slice 53500 385
solar 1066 10
stock 536 11
yacht 308 6
airfoil 1503 5
autos 159 25
breastcancer 194 33
buzz 583250 77
concrete 1030 8
elevators 16599 18
fertility 100 9
gas 2565 128
housing 506 13
keggundirected 63608 27
machine 209 7
pendulum 630 9
protein 45730 9
servo 167 4
skillcraft 3338 19
sml 4137 26
song 515345 90
tamielectric 45781 3
wine 1599 11

Dataset information can be obtained from the all_datasets dictionary. For example, to obtain a list of all datasets with fewer than 1000 observations, execute the following:

from uci_datasets import all_datasets
[name for name, (n_observations, n_dimensions) in all_datasets.items() if n_observations < 1000]

Papers using these datasets

The following papers use the same datasets and test-train splits present in this repository.