Open manzt opened 2 months ago
cells.parquet is generated by taking a tsv file output of the mx_inspect command which outputs a dataframe
I then append the observation_id to each tsv and "join" the csv with celltype labels
then i concatenate all tsv files and convert to parquet with duckdb
good idea to think about optimization. lets keep it in mind as we think about long term large cell number representations
Not sure how
cells.parquet
is being generated.... but we could try to be smart about the underlying data types used in the file to improve the file size. Avoiding text files, we have a lot more knobs to tune.Depends on whether the file itself is a bottleneck, but for example I used polars (could similarly with pandas) to specify more efficient data types (primarily, casting strings to categoricals and integers to smaller bitwidths):
You can tryout this script with uv 0.3.4 by copying it to your clipboard and piping from stdin:
I'd probably put this in the category of "potential optimizations", but maybe something to think about before considering something like partitioning the
cells.parquet
file. Optimization above is ~1/3 smaller: