hubmapconsortium / portal-containers

Docker containers to pre-process data for visualization in the portal
MIT License
0 stars 1 forks source link

check dtypes in h5ad #15

Open mccalluc opened 4 years ago

mccalluc commented 4 years ago

Trevor notes:

@mccalluc: it would be good to check the dtypes in umap = ann_data.obsm['X_umap']. I would be surprised if the hdf5 data used unnecessarily large dtypes, but pandas defaults to float64 for csv numerics. This was a headache I ran into with arrow early on.

manzt commented 4 years ago

I would be surprised if numeric dtypes were huge (but good to check!). However, in my experience people forget that casting a column in pandas as categorical for many repeated entries (ie. cell type, etc) can lead to a much lower memory footprint. For saving arrow, I found converting categorical columns had some nice benefits in the resulting arrow size. I found it easiest to convert these types on the pandas.DataFrame and then let pyarrow take care of mapping these to arrow-specific dtypes.

manzt commented 4 years ago

potentially useful: https://github.com/manzt/arrow-loader-demo/blob/c682ea7132830e45d7867e7d4a928fa063db6867/data/json2arrow.py#L20-L30