openml / openml-python

OpenML's Python API for a World of Data and More 💫
http://openml.github.io/openml-python/
Other
280 stars 144 forks source link

Don't download (large) datasets by default #1034

Closed joaquinvanschoren closed 2 weeks ago

joaquinvanschoren commented 3 years ago

Description

In datasets.get_dataset(data_id) the default is currently to always download the dataset: https://openml.github.io/openml-python/master/generated/openml.datasets.get_dataset.html#openml.datasets.get_dataset

This is problematic for large datasets - it takes a long time and may cause out-of-memory errors. Sometimes we need to look at the full meta-data (of many datasets) without downloading the data. We can do that now with the option download_data=False, but it feels like this should be the default. Some users may also be unaware of this option or the fact that get_dataset will actually download the data and consume resources.

A simple solution would be to make download_data=False the default.

Steps/Code to Reproduce

import openml
openml.datasets.get_dataset(41081)

Expected Results

The dataset metadata within seconds

Actual Results

A long time waiting until the dataset has downloaded and parsed.

Versions

macOS-10.16-x86_64-i386-64bit Python 3.8.5 (default, Sep 4 2020, 02:22:02) [Clang 10.0.0 ] NumPy 1.19.5 SciPy 1.5.2 Scikit-Learn 0.23.2 OpenML 0.11.1dev

PGijsbers commented 3 years ago

I'd also prefer this. I'd go as far as that I'd prefer lazy loading for all data that requires disk or network operations.

LennartPurucker commented 1 year ago

We set not downloading to be the default from 0.15.0 onwards in PR #1260.

PGijsbers commented 4 weeks ago

Started working on this with the add/1034 branch