Open mrocklin opened 6 years ago
@ogrisel has mentioned the Criteo dataset which is about 1TB of click logs.
(although note that we don't necessarily only want to focus on large datasets)
Some of these tools might be useful:
I've played with the data.world integrations, and was pleasantly surprised. I haven't touched Kaggle or Quilt. I've used torch pretty extensively (though not torch.data) and enjoyed it. Some example datasets from these tools:
torch.utils.data
. What class is the object in a color image? These are (at least) tens of thousands small imagesI think you mean "useful to scikit-learn", in which case I'll add the other Kaggle datasets and other the other datasets on data.wprkd or Quilt (I spent time looking for image datasets).
Diagnosing heart disease from DICOM MRI images https://www.kaggle.com/c/second-annual-data-science-bowl/data
In this dataset, you are given hundreds of cardiac MRI images in DICOM format. These are 2D cine images that contain approximately 30 images across the cardiac cycle. Each slice is acquired on a separate breath hold. This is important since the registration from slice to slice is expected to be imperfect.
The competition task is to create an automated method capable of determining the left ventricle volume at two points in time: after systole, when the heart is contracted and the ventricles are at their minimum volume, and after diastole, when the heart is at its largest volume.
Diagnosing lung cancer from DICOM CT images https://www.kaggle.com/c/data-science-bowl-2017/data
In this dataset, you are given over a thousand low-dose CT images from high-risk patients in DICOM format. Each image contains a series with multiple axial slices of the chest cavity. Each image has a variable number of 2D slices, which can vary based on the machine taking the scan and patient.
The DICOM files have a header that contains the necessary information about the patient id, as well as scan parameters such as the slice thickness.
The competition task is to create an automated method capable of determining whether or not the patient will be diagnosed with lung cancer within one year of the date the scan was taken. The ground truth labels were confirmed by pathology diagnosis.
awesome-public-datasets has a bunch of datasets; some of which are useful for ML: https://github.com/awesomedata/awesome-public-datasets
if somebody tried to run locally ? like https://github.com/rambler-digital-solutions/criteo-1tb-benchmark
In order to make the most of our time at the scaling scikit-learn sprint it might be helpful to prepare some challenge problems and datasets that we want to focus on before we arrive. Ideally these datasets and problems have the following qualities:
An equivalent issue for image processing datasets is posed at #2