Load data from sparse matrix format

ljjh20 commented 3 years ago

[ ] Load data from sparse matrix format .h5ad output of scanPy library is focused around a sparse matrix with additional metadata columns. Is more memory efficient, allowing for larger datasets to be loaded. Stores the cell counts. Also, since most users of dimensionally reduced single cell seq data will have data in h5ad format, reading directly from it would be convenient. ScanPy provides a 'to .zarr' conversion. Therefore, reading from .zarr into TSNEPlot.kt and visualizing from there may be a good approach.

Could restructure the data to just have cells, genes, and main metadata.

ljjh20 commented 3 years ago

Update 11.03.21 Roadblock with jhdf5 means I cannot read from the source files at https://figshare.com/articles/dataset/Tabula_Muris_Senis_Data_Objects/12654728 directly. This is especially the case for annotations and observations, as these are each squished into compound datasets (consisting of various primitives) that I was not able to read using jhdf5.

Writing the data to Zarr format has the advantage of not concatenating obs and var. An attempt using jZarr was also unsuccessful, as it flattens the data and can't output it in a very useful way without the support of additional libraries, adding too much clutter. Another attempt using the n5-Zarr library from Stephan Saalfeld brought more success, able to read most data quite effectively. However, due to the way scanPy works, its HDF5 format h5ad has pointers as indices for obs and var, relating these annotations to the correct gene expression in X. Writing this to a string or reindexing in pandas would allow it to be read by n5-Zarr, but would break it for scanPy.

ljjh20 commented 3 years ago

Python Java interaction

A python class I have written converts the files containing a 2d umap to one that also contains a 3D umap. I write the annotations to csvs, however, X is sparse, and making it dense and unchunked would severely hurt performance. As it just constains float values, it could be read by n5-Zarr, but that would require passing both the csvs and the h5ad file converted to Zarr for XtraDimensionVR to handle, which is not very elegant.

ljjh20 commented 3 years ago

X

Sparse matrix csc and csr consist of three arrays, that, if read properly (such as with https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/linalg/SparseMatrix.html#apply(int,%20int) .apply(i, j), or writing a reader manually), will be able to fetch any non-zero value. Arrays only store value if non-zero, so will have to infer zero values. The only way to save a sparse matrix to disk from scipy is their proprietary function, which exists only in python. May be able to save the three arrays as csv, but length may be prohibitive.

ljjh20 commented 3 years ago

ToDo: display scale bar for gene expression, as it is currently normalized to fit the color map that goes from 0 to 10.

ljjh20 commented 3 years ago

If I read and write AnnData object (a dataset), it deconcatenates the observations out of a compound dataset form, but seems to compress repeating entries by replacing them with integers. However, the translations is not stored anywhere viewable and would need to be manually mapped for each dataset.

ljjh20 commented 3 years ago

Read directly from h5ad using jHDF5. Reading from scanPy and writing to new hdf5 deconcatenates the compound dtasets into groups, allowing me to use jhdf5. Unfortunately also calles the strings_to_categoricals() function, requiring the maps to be stored ins 'uns' during the file conversion, to then be read in kotlin.

scenerygraphics / corvo-core

Load data from sparse matrix format #2