ncats / multiplex-analysis-web-apps

https://ncats.github.io/multiplex-analysis-web-apps/
1 stars 0 forks source link

Port to anndata if that makes sense #96

Open andrew-weisman opened 3 months ago

andrew-weisman commented 3 months ago

Really need to see what the benefits are, if they'll actually help.

andrew-weisman commented 3 months ago

Note I have checked that on a 130k-row dataset, using the sparse format for adata.X in the most efficient way makes adata take up 83 MB whereas using the efficient pandas dataframe takes up 45 MB. That's probably because converting the numerical values to numpy uses the smallest necessary datatype for any of the data, which is float32, so even e.g. 0's and 1's are being cast as float32 and it's therefore inefficient.

We could probably still use the efficient Pandas df as adata.X and then we'd roughly maintain our efficiency, but this shows that the only thing we might get is reorganization of our data (and possibly some compatibility benefits) instead of performance improvements. Though I'm still intrigued how adata integrates with HDF5.

See anndata.ipynb in OneDrive for my notes on this.