uci-ml-repo / ucimlrepo

Python package for dataset imports from UCI ML Repository
MIT License
199 stars 80 forks source link

Suggestion: Add caching to package or saving/loading code to examples. #14

Open manfred-lindmark opened 2 months ago

manfred-lindmark commented 2 months ago

First of all thanks for the great tool, getting datasets should always be this simple.

A have a suggestion that would make it a bit easier to get started working with these datasets. Since downloading from UCI was pretty slow for me (several minutes for 6 MB, maybe because of my university's VPN), it would be good to save the dataset locally the first time it's been downloaded.

Maybe caching it using python's tempfile package is a good idea, or else add saving to the example for how to use this package.

from ucimlrepo import fetch_ucirepo
import pickle
import os

dataset_id = 2
fname = f"id_{dataset_id}.pkl"

if os.path.isfile(fname):
    with open(fname, "rb") as f:
        data = pickle.load(f)
else:
    data = fetch_ucirepo(id=dataset_id)
    with open(fname, "wb") as f:
        pickle.dump(data, f)
ripaul commented 1 month ago

I was just looking into exactly this and it turns out you cannot pickle the dotdicts the ucimlrepo uses. At least for me it fails with a strange error:

python3 test.py 
Traceback (most recent call last):
  File "/home/rpaul/proj/bnn-benchmark/src/test.py", line 10, in <module>
    data = pickle.load(f)
TypeError: 'NoneType' object is not callable

which however can be googled and leads to this SO: https://stackoverflow.com/a/2050357. Adding the required methods to the dotdict resolves the issue for me. I opened a pull request for the change. It doesn't yet cache the downloaded data, but at least it allows you to implement caching manually.