Cache run predictions - Githubissues

PGijsbers commented 1 year ago

Problem

Predictions of runs are not cached when downloaded. Note that predictions only get downloaded when get_metric_fn is called in the first place (this is desired behavior, the description file already contains precomputed evaluations).

MWE

CLI: ls ~/.openml/org/openml/www/runs/10591753/ Output: ls: /Users/pietergijsbers/.openml/org/openml/www/runs/10591753/: No such file or directory

Execute:

import openml
import logging
from sklearn.metrics import accuracy_score

logging.basicConfig(level=logging.DEBUG)
run = openml.runs.get_run(10591753)
run.get_metric_fn(accuracy_score)

output:

>>> run = openml.runs.get_run(10591753)
INFO:root:Starting [get] request for the URL https://www.openml.org/api/v1/xml/run/10591753
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.openml.org:443
DEBUG:urllib3.connectionpool:https://www.openml.org:443 "GET /api/v1/xml/run/10591753 HTTP/1.1" 307 336
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.openml.org:443
DEBUG:urllib3.connectionpool:https://api.openml.org:443 "GET /api/v1/xml/run/10591753 HTTP/1.1" 200 5112
INFO:root:0.1340468s taken for [get] request for the URL https://www.openml.org/api/v1/xml/run/10591753

>>> run.get_metric_fn(accuracy_score)
INFO:root:Starting [get] request for the URL https://www.openml.org/data/download/22111640/predictions.arff
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.openml.org:443
DEBUG:urllib3.connectionpool:https://www.openml.org:443 "GET /data/download/22111640/predictions.arff HTTP/1.1" 307 352
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.openml.org:443
DEBUG:urllib3.connectionpool:https://api.openml.org:443 "GET /data/download/22111640/predictions.arff HTTP/1.1" 200 None
INFO:root:0.0710640s taken for [get] request for the URL https://www.openml.org/data/download/22111640/predictions.arff

array([0.76623377, 0.5974026 , 0.72727273, 0.68831169, 0.7012987 ,
       0.75324675, 0.77922078, 0.77922078, 0.73684211, 0.67105263])

CLI: ls ~/.openml/org/openml/www/runs/10591753/ Output: description.xml

Note that there are no signs of the prediction arff file being present on disk - as you would expect from reading the source code.

mfeurer commented 1 year ago

After an offline discussion with @PGijsbers we agreed that this should be an optional feature, i.e. that caching is disabled by default, but can be enabled.

PGijsbers commented 1 year ago

There are definitely cases where this is useful (experimenting with evaluation metrics or ensembling), but the average user that probably doesn't load the same runs many times. Because it would quickly occupy a lot of disk space, we think opt-in is better.

openml / openml-python

Cache run predictions #1191

Problem

MWE