Proposal to include `numberOfInstances` and `numberOfFeatures` qualities in the dataset description

openml / OpenML

Open Machine Learning

https://openml.org

BSD 3-Clause "New" or "Revised" License

658 stars 91 forks source link

Proposal to include `numberOfInstances` and `numberOfFeatures` qualities in the dataset description #1101

Open PGijsbers opened 3 years ago

PGijsbers commented 3 years ago

The dataset description.xml contains some of the most useful meta-data of the dataset. I think the number of instances/rows and the number of features should be added here. Those features generally tend to be of the most interest (e.g. making a natural inclusion in openml-python's dataset representation), but requires an additional download which incurs user wait time and strains the server. There's already a precedent for including including features that directly reference the data (e.g. default_target_attribute, ignore_attribute and row_id_attribute), at the same time I realize we want to be careful about slowly creating one monolithic file. The specific use case that lead me to consider this is that the automl benchmark downloads qualities only to obtain the dataset dimensions. What do you think?

joaquinvanschoren commented 3 years ago

If it reduces the number of API calls this would be useful (even if the API call is a tiny bit slower). If we do this it doesn't really matter if we add 2 fields or a few more. E.g. the number of classes may also be useful? What should the return value be? Something like a parent tag 'qualities' and below that name-value pairs as children?

@sahithyaravi1493 could you check what the speed impact is? @janvanrijn any comments on this?

janvanrijn commented 3 years ago

Why don't you use the dataset list function?

You can get there all tasks/datasets attached to a study, and it contains some important qualities (if available)

PGijsbers commented 3 years ago

In the case of the automl benchmark, we actually approach the dataset through its task (we know the task id). So using the list_datasets or getting the qualities directly both require an extra query. That said, it looks like I had actually misunderstood some code in the benchmark and I think we can work around this limitation now. It does require an update to openml-python as it is still downloading data too eagerly.

Thanks everyone for the insight/discussion. I think it would still be interesting to know the effect it has on query time, but I see no reason to actually go forward with this proposal at this point.