manzt / quaklas

0 stars 0 forks source link

Optimize cell.parquet? #3

Open manzt opened 2 months ago

manzt commented 2 months ago

Not sure how cells.parquet is being generated.... but we could try to be smart about the underlying data types used in the file to improve the file size. Avoiding text files, we have a lot more knobs to tune.

Depends on whether the file itself is a bottleneck, but for example I used polars (could similarly with pandas) to specify more efficient data types (primarily, casting strings to categoricals and integers to smaller bitwidths):

import polars as pl

(
    pl
    .read_parquet("cells.parquet")
    .with_columns(
        counts_min=pl.col("counts_min").cast(pl.UInt32),
        counts_max=pl.col("counts_max").cast(pl.UInt32),
        counts_sum=pl.col("counts_sum").cast(pl.UInt32),
        counts_nnzero=pl.col("counts_nnzero").cast(pl.UInt32),
        observation_id=pl.col("observation_id").cast(pl.Categorical),
        label=pl.col("label").cast(pl.Categorical),
        label_id=pl.col("label_id").cast(pl.UInt16),
    )
    .write_parquet("cells-optimized.parquet")
)

You can tryout this script with uv 0.3.4 by copying it to your clipboard and piping from stdin:

pbpaste | uv run --python 3.12 --with 'polars>=1.5' -

I'd probably put this in the category of "potential optimizations", but maybe something to think about before considering something like partitioning the cells.parquet file. Optimization above is ~1/3 smaller:

❯ ll | grep cells
.rw-r--r--  19M manzt 27 Aug 16:55 cells-optimized.parquet
.rw-r--r--  28M manzt 27 Aug 16:45 cells.parquet
sbooeshaghi commented 2 months ago

cells.parquet is generated by taking a tsv file output of the mx_inspect command which outputs a dataframe

https://github.com/cellatlas/mx/blob/1ce644679b95006ed3584baaa28652a94c48243f/mx/mx_inspect.py#L151-L163

I then append the observation_id to each tsv and "join" the csv with celltype labels

then i concatenate all tsv files and convert to parquet with duckdb

good idea to think about optimization. lets keep it in mind as we think about long term large cell number representations