Regression datasets from the UCI machine learning repository prepared for benchmarking studies with test-train splits.
Install using pip (the download size is about 312 Mb):
python -m pip install git+https://github.com/treforevans/uci_datasets.git
The following code gets the first test-train split (i.e., split=0
) of the challenger
dataset:
from uci_datasets import Dataset
data = Dataset("challenger")
x_train, y_train, x_test, y_test = data.get_split(split=0)
There are 10 test-train splits for each dataset (as in 10-fold cross validation) with 90% of the dataset being training points and 10% being testing points in each split.
The split
parameter of the Dataset.get_split
method accepts integers from 0 to 9 (inclusive).
The below table contains the size (number of observations) and the number of input dimensions of each dataset. All datasets have a single output dimension.
Dataset name | Number of observations | Input dimension |
---|---|---|
3droad |
434874 | 3 |
autompg |
392 | 7 |
bike |
17379 | 17 |
challenger |
23 | 4 |
concreteslump |
103 | 7 |
energy |
768 | 8 |
forest |
517 | 12 |
houseelectric |
2049280 | 11 |
keggdirected |
48827 | 20 |
kin40k |
40000 | 8 |
parkinsons |
5875 | 20 |
pol |
15000 | 26 |
pumadyn32nm |
8192 | 32 |
slice |
53500 | 385 |
solar |
1066 | 10 |
stock |
536 | 11 |
yacht |
308 | 6 |
airfoil |
1503 | 5 |
autos |
159 | 25 |
breastcancer |
194 | 33 |
buzz |
583250 | 77 |
concrete |
1030 | 8 |
elevators |
16599 | 18 |
fertility |
100 | 9 |
gas |
2565 | 128 |
housing |
506 | 13 |
keggundirected |
63608 | 27 |
machine |
209 | 7 |
pendulum |
630 | 9 |
protein |
45730 | 9 |
servo |
167 | 4 |
skillcraft |
3338 | 19 |
sml |
4137 | 26 |
song |
515345 | 90 |
tamielectric |
45781 | 3 |
wine |
1599 | 11 |
Dataset information can be obtained from the all_datasets
dictionary.
For example, to obtain a list of all datasets with fewer than 1000 observations, execute the following:
from uci_datasets import all_datasets
[name for name, (n_observations, n_dimensions) in all_datasets.items() if n_observations < 1000]
The following papers use the same datasets and test-train splits present in this repository.