openml / meta

Repository for issues which are not for any one specific repository (e.g., governance, data models)
0 stars 0 forks source link

Build a meta-feature (evaluation) engine in Python #2

Open PGijsbers opened 2 months ago

PGijsbers commented 2 months ago

The evaluation engine is a component on the server which handles multiple tasks. This is currently implemented in Java and we want to rebuild it in Python, and compartmentalised per each function, for easier maintenance/more accessible to new contributors. One of its tasks is calculating meta-features over tabular datasets.

The engine should take tabular datasets and calculate a set of meta-features of them. Meta-features with an existing name should produce identical results, as much as possible currently available meta-features should remain available. Probably want to work with PyMFE.

PGijsbers commented 2 months ago

@joaquinvanschoren you were assigned and there is a listed "in progress". Could you write down what progress there is, if any? Then unassign yourself (assuming you are not working on this).

joaquinvanschoren commented 2 months ago

@NathanFCarvalho worked on this from March-June. He has written a script to compute meta-features with PyMFE which works on almost all datasets (tested on about 5000 datasets, but slow on the very large ones). It's a script because PyMFE does most of the work.

All code and documentation is here: https://github.com/NathanFCarvalho/OpenML_Metafeature_Extraction

The remaining task would be to store the computed meta-features in OpenML, and rework the code so it can run as a cronjob. Sidenote: PyMFE uses different names for the metafeatures, and they can be quite cryptic. Nathan made a mapping to more understandable names. However, these are not 100% the same as the existing meta-features. We need to decide whether we want to keep the old meta-features, or exclusively use the new ones for consistency.

I unassigned myself since I have a lot on my plate already, but this should be a very doable and well-contained task.