Example problems and datasets for machine learning

mrocklin commented 6 years ago

In order to make the most of our time at the scaling scikit-learn sprint it might be helpful to prepare some challenge problems and datasets that we want to focus on before we arrive. Ideally these datasets and problems have the following qualities:

They represent classes of real problems faced by many researchers today
They are challenging for scikit-image today but might be made more comfortable by improving scalability
They are based on datasets that are publicly and easily accessible
They are as simple as possible, given the constraints above

An equivalent issue for image processing datasets is posed at #2

mrocklin commented 6 years ago

@ogrisel has mentioned the Criteo dataset which is about 1TB of click logs.

(although note that we don't necessarily only want to focus on large datasets)

stsievert commented 6 years ago

Some of these tools might be useful:

torch.utils.data, which provides easy access to popular research datasets. This tool can help load and get NumPy arrays of data quickly from a folder of images. Many popular datasets are in torchvision.datasets.
data.world and Quilt data, which provides easy online access to datasets from the Python prompt with their integration (data.world API, quilt API).
Kaggle image datasets, which can be accessed with kaggle-api. These are often paired with solutions.

I've played with the data.world integrations, and was pleasantly surprised. I haven't touched Kaggle or Quilt. I've used torch pretty extensively (though not torch.data) and enjoyed it. Some example datasets from these tools:

CIFAR, SVHN. These are classic image recognition problems, available through torch.utils.data. What class is the object in a color image? These are (at least) tens of thousands small images
CMU face dataset on data.world, of 20 different people with 32 images per person (in various emotion/angle/etc combos).
AICS 24 dataset on Quilt. ~120GB of 2D and 3D cell images.
Bone age dataset on Kaggle. Can you predict the age of a child from an xray of their hand?

I think you mean "useful to scikit-learn", in which case I'll add the other Kaggle datasets and other the other datasets on data.wprkd or Quilt (I spent time looking for image datasets).

westurner commented 6 years ago

Diagnosing heart disease from DICOM MRI images https://www.kaggle.com/c/second-annual-data-science-bowl/data

In this dataset, you are given hundreds of cardiac MRI images in DICOM format. These are 2D cine images that contain approximately 30 images across the cardiac cycle. Each slice is acquired on a separate breath hold. This is important since the registration from slice to slice is expected to be imperfect.

The competition task is to create an automated method capable of determining the left ventricle volume at two points in time: after systole, when the heart is contracted and the ventricles are at their minimum volume, and after diastole, when the heart is at its largest volume.

westurner commented 6 years ago

Diagnosing lung cancer from DICOM CT images https://www.kaggle.com/c/data-science-bowl-2017/data

In this dataset, you are given over a thousand low-dose CT images from high-risk patients in DICOM format. Each image contains a series with multiple axial slices of the chest cavity. Each image has a variable number of 2D slices, which can vary based on the machine taking the scan and patient.

The DICOM files have a header that contains the necessary information about the patient id, as well as scan parameters such as the slice thickness.

The competition task is to create an automated method capable of determining whether or not the patient will be diagnosed with lung cancer within one year of the date the scan was taken. The ground truth labels were confirmed by pathology diagnosis.

westurner commented 6 years ago

awesome-public-datasets has a bunch of datasets; some of which are useful for ML: https://github.com/awesomedata/awesome-public-datasets

Sandy4321 commented 4 years ago

if somebody tried to run locally ? like https://github.com/rambler-digital-solutions/criteo-1tb-benchmark

scisprints / 2018_05_sklearn_skimage_dask

Example problems and datasets for machine learning #1