Closed joaquinvanschoren closed 2 weeks ago
I'd also prefer this. I'd go as far as that I'd prefer lazy loading for all data that requires disk or network operations.
We set not downloading to be the default from 0.15.0 onwards in PR #1260.
Started working on this with the add/1034
branch
Description
In datasets.get_dataset(data_id) the default is currently to always download the dataset: https://openml.github.io/openml-python/master/generated/openml.datasets.get_dataset.html#openml.datasets.get_dataset
This is problematic for large datasets - it takes a long time and may cause out-of-memory errors. Sometimes we need to look at the full meta-data (of many datasets) without downloading the data. We can do that now with the option download_data=False, but it feels like this should be the default. Some users may also be unaware of this option or the fact that get_dataset will actually download the data and consume resources.
A simple solution would be to make download_data=False the default.
Steps/Code to Reproduce
Expected Results
The dataset metadata within seconds
Actual Results
A long time waiting until the dataset has downloaded and parsed.
Versions
macOS-10.16-x86_64-i386-64bit Python 3.8.5 (default, Sep 4 2020, 02:22:02) [Clang 10.0.0 ] NumPy 1.19.5 SciPy 1.5.2 Scikit-Learn 0.23.2 OpenML 0.11.1dev