Reading a Vortex array in Polars is slower than Parquet

Polars reads the Parquet encoded dataset in 800ms whereas Vortex takes 4s.

reproduction

Assuming you have the PBI parquet dataset downloaded, write a Vortex file:

import vortex
import pyarrow as pa
import pandas as pd
import polars as pl

vtx = vortex.array(
    pa.Table.from_pandas(
        pd.read_parquet('bench-vortex/data/PBI/CMSprovider/parquet/CMSprovider_1.parquet')))
compressed = vortex.encoding.compress(vtx)
vortex.io.write(compressed, 'bench-vortex/data/PBI/CMSprovider/vortex/CMSprovider_1.vortex')

Parquet takes 800-900ms:

%%time
ds = pa.dataset.dataset(
   'bench-vortex/data/PBI/CMSprovider/parquet/CMSprovider_1.parquet'
)
lf = pl.scan_pyarrow_dataset(ds)
lf.collect()

Once the VortexDataset PR lands, you'll see that Vortex takes around 3s:

%%time
import vortex
import polars as pl

ds = vortex.arrow.VortexDataset(
    'bench-vortex/data/PBI/CMSprovider/vortex/CMSprovider_1.vortex'
)
lf = pl.scan_pyarrow_dataset(ds)
lf.collect()

spiraldb / vortex

Reading a Vortex array in Polars is slower than Parquet #1071

reproduction