Optimize cell.parquet? - Githubissues

Not sure how cells.parquet is being generated.... but we could try to be smart about the underlying data types used in the file to improve the file size. Avoiding text files, we have a lot more knobs to tune.

Depends on whether the file itself is a bottleneck, but for example I used polars (could similarly with pandas) to specify more efficient data types (primarily, casting strings to categoricals and integers to smaller bitwidths):

import polars as pl

(
    pl
    .read_parquet("cells.parquet")
    .with_columns(
        counts_min=pl.col("counts_min").cast(pl.UInt32),
        counts_max=pl.col("counts_max").cast(pl.UInt32),
        counts_sum=pl.col("counts_sum").cast(pl.UInt32),
        counts_nnzero=pl.col("counts_nnzero").cast(pl.UInt32),
        observation_id=pl.col("observation_id").cast(pl.Categorical),
        label=pl.col("label").cast(pl.Categorical),
        label_id=pl.col("label_id").cast(pl.UInt16),
    )
    .write_parquet("cells-optimized.parquet")
)

You can tryout this script with uv 0.3.4 by copying it to your clipboard and piping from stdin:

pbpaste | uv run --python 3.12 --with 'polars>=1.5' -

I'd probably put this in the category of "potential optimizations", but maybe something to think about before considering something like partitioning the cells.parquet file. Optimization above is ~1/3 smaller:

❯ ll | grep cells
.rw-r--r--  19M manzt 27 Aug 16:55 cells-optimized.parquet
.rw-r--r--  28M manzt 27 Aug 16:45 cells.parquet

manzt / quaklas

Optimize cell.parquet? #3