openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
391 stars 130 forks source link

Any documentation on OpenmlDataset and TaskConfig? #592

Open sedol1339 opened 9 months ago

sedol1339 commented 9 months ago

Hello! When looking at implementation of custom framworks in examples folder, I noticed that they accept amlb.datasets.openml.OpenmlDataset and amlb.benchmark.TaskConfig. The first class looks different from OpenMLDataset from openml package, since it has another methods. Where may I find documentation for these classes?

Or if there is no documentation, how do I write a custom framework?

For example, dataset.features seems to be a list of features, and each feature has data_type field. I noticed that it may have values category and number. May it have another values? Same question for all other fields.

sedol1339 commented 9 months ago

Until now, tried to write my own documentation:

Class amlb.datasets.openml.OpenmlDataset:
    .features - list of amlb.data.Feature
    .fold - int
    .inference_subsample_files(fmt) - some unclear method
    .nrows - some field that raises an error
    .predictors - seems like list of amlb.data.Feature that are predictors
    .target - seems like amlb.data.Feature that is not target
    .release() - some unclear method
    .train - object of class amlb.datasets.openml.OpenmlDatasplit
    .test - object of class amlb.datasets.openml.OpenmlDatasplit
    .type - instance of enum amlb.data.DatasetType
      (binary = 1, multiclass = 2, regression = 3)

Class amlb.data.Feature:
    .data_type - string, may be 'category', 'number' (maybe also another values?)
    .has_missing_values - boolean
    .index - int
    .is_categorical() - boolean
    .is_numerical() - boolean
    .is_target - boolean
    .label_encoder - object of class amlb.datautils.Encoder
    .name - string name of feature
    f.normalize(arr) - some unclear method
    .one_hot_encoder - object of class amlb.datautils.Encoder
    .values - looks like list of classes for categorical features, or None for numerical

Class amlb.datautils.Encoder:
    .classes - looks like list of classes for categorical features, or None for numerical
    Has also fields and methods: 'delegate', 'encoded_type', 'fit', 'fit_transform',
      'for_target', 'inverse_transform', 'missing_encoded_value', 'missing_policy',
      'missing_replaced_by', 'missing_values', 'normalize_fn', 'set_output', 'transform'

Class amlb.datasets.openml.OpenmlDatasplit:
    .X - some field that throws error: 'NoneType' object has no attribute 'config'
    .y - some field that throws error: 'NoneType' object has no attribute 'config'
    .data - some field that throws error: 'NoneType' object has no attribute 'config'
    .X_enc - looks like 2-dimensional np.ndarray of predictors (label-encoded for categorical)
    .y_enc - looks like 1-dimensional np.ndarray of target (label-encoded for classification)
    .data_enc - looks like .X_enc and .y_enc combined (index of target column probably not always is -1)
    .data_path(format) - some unclear method
    .dataset - looks like backward link to the dataset object
    .format - 'arff', maybe may take another values
    .path - some field that throws error: 'NoneType' object has no attribute 'config'
    .release() - some unclear method

As you can see, there is still many fields and methods that are unclear or throw errors

PGijsbers commented 9 months ago

Hi! Unfortunately, there is currently no documentation for these objects other than comments in their source code. Some, but not all, of the code has additional docstrings which clarify some of the questions you have. We recognize this makes for a bad experience and refactoring this is on our radar, as is updating our integration documentation.

As most frameworks are close to ones already integrated, generally using other integrations as documentation is the easiest way to achieve a functional integration. Additionally, you may use debugging tools (breakpoints or print statements) to inspect the objects at runtime--far from ideal.

The main purpose for __init__.py's run function is to prepare the data for the framework. Most of the time this is, if anything, unsparsifying potentially sparse data and/or encoding categorical features. The documentation you provided is largely correct. Below lists only fields for which I provide clarifications and/or questions.

Class amlb.datasets.openml.OpenmlDataset:
    .inference_subsample_files(fmt) - an experimental feature which is used to generate subsamples from the test data, used solely to measure inference times.
    .nrows - this field should not raise an error, if you can find a reproducible example please share it as a separate issue.
    .target - this *should* be the target feature to predict, if you can find a reproducible example where it provides a feature which is not the target, please share it as a separate issue.
    .release() - it releases a bunch of references to cached data with the purpose for freeing up memory.

Class amlb.data.Feature:
    .data_type - a pandas compatible type
    .has_missing_values - whether or not the data has missing values
    .index - the column index of the feature in the original data frame
    f.normalize(arr) - normalizes the feature name, should have been private
    .values - looks like list of classes for categorical features, or None for numerical => correct

Class amlb.datautils.Encoder: Used to encode features for frameworks which can not deal with categorical data natively. You shouldn't need to use this directly if you are only writing a framework integration.

Class amlb.datasets.openml.OpenmlDatasplit:
    .X - some field that throws error: 'NoneType' object has no attribute 'config' => how do you access this? can you create a new issue with MWE? It should be the <train|test> data without labels.
    .y - some field that throws error: 'NoneType' object has no attribute 'config' -> same, but should be just the target column
    .data - some field that throws error: 'NoneType' object has no attribute 'config' => should be X+y
    .X_enc - looks like 2-dimensional np.ndarray of predictors (label-encoded for categorical)
    .y_enc - looks like 1-dimensional np.ndarray of target (label-encoded for classification)
    .data_enc - looks like .X_enc and .y_enc combined => correct
    .data_path(format) - retrieves the name for the file of the data in the requested format
    .dataset - looks like backward link to the dataset object => correct
    .format - 'arff', maybe may take another values => also 'csv' and 'parquet' for non-openml datasets
    .path - some field that throws error: 'NoneType' object has no attribute 'config' => Should be the path to the file on disk in the default format
    .release() - again release resources

Hope that helps a little. If you have any questions, don't hesitate to ask.