683 was solved by switching a single line from polars df.explode(columns) which utilizes multiprocessing to a pure numpy-based solution. This numpy-based solution gave a significant speed up to ParquetDataset.__getitem__. In #677 I found the ParquetDataset.__getitem__ to be ~1.8 times slower than it's SQLite counterpart on a 1 million event sample. Following this PR, ParquetDataset.__getitem__ is ~1.2 times slower than its SQLite counterpart on the same sample.
Closes #683, closes #685 .
683 was solved by switching a single line from polars
df.explode(columns)
which utilizes multiprocessing to a pure numpy-based solution. This numpy-based solution gave a significant speed up toParquetDataset.__getitem__
. In #677 I found theParquetDataset.__getitem__
to be ~1.8 times slower than it's SQLite counterpart on a 1 million event sample. Following this PR,ParquetDataset.__getitem__
is ~1.2 times slower than its SQLite counterpart on the same sample.