openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
398 stars 132 forks source link

Add support for sparse data #186

Open sebhrusen opened 3 years ago

sebhrusen commented 3 years ago

Currently sparse dataset are automatically converted into dense data, generating extremely large datasets that can lead to OOM. OpenML provide some datasets in sparse ARFF format: see for exampel https://www.openml.org/t/317613

The benchmark app needs to be able to load those sparse data and pass them to frameworks without converting them to dense data, then it is left to frameworks responsibility to handle those sparse data. If they don't we can provide a utility function to convert them into densa data, knowing that it may lead to OOM in some situations.

We may want to implement https://github.com/openml/automlbenchmark/issues/116 first and then use pandas sparse dataframes.

sebhrusen commented 3 years ago

We need to verify sparse data handling now that https://github.com/openml/automlbenchmark/pull/293 is merged.