Instead of reading the arff files as weka Instances, this PR switches to weka ArffReader.
This way, the complete file is not read in memory, but it is evaluated row by row, reducing memory footprint.
I changed only the behavior of ProcessDataset. Other places where Arff files are read, are still reading the complete file into memory (albeit more explicitly, by creating the Instances inside the EvaluationEngine instead of inside openml-weka). I understood that ProcessDataset was the biggest bottleneck.
Instead of reading the arff files as weka
Instances
, this PR switches to wekaArffReader
. This way, the complete file is not read in memory, but it is evaluated row by row, reducing memory footprint.I changed only the behavior of
ProcessDataset
. Other places where Arff files are read, are still reading the complete file into memory (albeit more explicitly, by creating theInstances
inside the EvaluationEngine instead of inside openml-weka). I understood thatProcessDataset
was the biggest bottleneck.See https://github.com/openml/openml-weka/pull/27 for the corresponding openml-weka PR.
Any feedback is appreciated!