scisprints / 2018_05_sklearn_skimage_dask

BSD 3-Clause "New" or "Revised" License
6 stars 2 forks source link

Example problems and datasets for machine learning #1

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

In order to make the most of our time at the scaling scikit-learn sprint it might be helpful to prepare some challenge problems and datasets that we want to focus on before we arrive. Ideally these datasets and problems have the following qualities:

  1. They represent classes of real problems faced by many researchers today
  2. They are challenging for scikit-image today but might be made more comfortable by improving scalability
  3. They are based on datasets that are publicly and easily accessible
  4. They are as simple as possible, given the constraints above

An equivalent issue for image processing datasets is posed at #2

mrocklin commented 6 years ago

@ogrisel has mentioned the Criteo dataset which is about 1TB of click logs.

(although note that we don't necessarily only want to focus on large datasets)

stsievert commented 6 years ago

Some of these tools might be useful:

I've played with the data.world integrations, and was pleasantly surprised. I haven't touched Kaggle or Quilt. I've used torch pretty extensively (though not torch.data) and enjoyed it. Some example datasets from these tools:

I think you mean "useful to scikit-learn", in which case I'll add the other Kaggle datasets and other the other datasets on data.wprkd or Quilt (I spent time looking for image datasets).

westurner commented 6 years ago

Diagnosing heart disease from DICOM MRI images https://www.kaggle.com/c/second-annual-data-science-bowl/data

In this dataset, you are given hundreds of cardiac MRI images in DICOM format. These are 2D cine images that contain approximately 30 images across the cardiac cycle. Each slice is acquired on a separate breath hold. This is important since the registration from slice to slice is expected to be imperfect.

The competition task is to create an automated method capable of determining the left ventricle volume at two points in time: after systole, when the heart is contracted and the ventricles are at their minimum volume, and after diastole, when the heart is at its largest volume.

westurner commented 6 years ago

Diagnosing lung cancer from DICOM CT images https://www.kaggle.com/c/data-science-bowl-2017/data

In this dataset, you are given over a thousand low-dose CT images from high-risk patients in DICOM format. Each image contains a series with multiple axial slices of the chest cavity. Each image has a variable number of 2D slices, which can vary based on the machine taking the scan and patient.

The DICOM files have a header that contains the necessary information about the patient id, as well as scan parameters such as the slice thickness.

The competition task is to create an automated method capable of determining whether or not the patient will be diagnosed with lung cancer within one year of the date the scan was taken. The ground truth labels were confirmed by pathology diagnosis.

westurner commented 6 years ago

awesome-public-datasets has a bunch of datasets; some of which are useful for ML: https://github.com/awesomedata/awesome-public-datasets

Sandy4321 commented 4 years ago

if somebody tried to run locally ? like https://github.com/rambler-digital-solutions/criteo-1tb-benchmark