This puts in place the basics to improve MLDB's dataset loading and ML setup
operations:
ContentDescriptors, so that we can refer to a dataset in a way that allows
us to cache and share intermediate results
Block-aware compression (currently for lz4 only), which allows for
compressed datasets to be chunked across multiple threads (zstd should be
possible too but is not implemented yet)
The use of large, contiguous and file backable memory blocks behind the
temporary datasets created, allowing for larger-than-core operation when
backed by a suitable SSD or other secondary storage
Parallelized feature analysis, bucketing and packing into optimized data
structures, reducing the memory usage and memory bandwidth requirements
of the setup phases for classic ML algorithms
Implementation of better column analysis on Tabular dataset loading, so
that less work needs to be done in the setup phase.
It allows the airlines CSV dataset to be loaded at around 2.3 million rows
per second (230MB/second) on an M1 Mac Mini, and at over 10 million rows
per second on a server class machine (in particular, multicore scaling is
significantly better than before).
This puts in place the basics to improve MLDB's dataset loading and ML setup operations:
It allows the airlines CSV dataset to be loaded at around 2.3 million rows per second (230MB/second) on an M1 Mac Mini, and at over 10 million rows per second on a server class machine (in particular, multicore scaling is significantly better than before).