Closed ljjh20 closed 3 years ago
Update 11.03.21 Roadblock with jhdf5 means I cannot read from the source files at https://figshare.com/articles/dataset/Tabula_Muris_Senis_Data_Objects/12654728 directly. This is especially the case for annotations and observations, as these are each squished into compound datasets (consisting of various primitives) that I was not able to read using jhdf5.
Writing the data to Zarr format has the advantage of not concatenating obs and var. An attempt using jZarr was also unsuccessful, as it flattens the data and can't output it in a very useful way without the support of additional libraries, adding too much clutter. Another attempt using the n5-Zarr library from Stephan Saalfeld brought more success, able to read most data quite effectively. However, due to the way scanPy works, its HDF5 format h5ad has pointers as indices for obs and var, relating these annotations to the correct gene expression in X. Writing this to a string or reindexing in pandas would allow it to be read by n5-Zarr, but would break it for scanPy.
Python Java interaction
A python class I have written converts the files containing a 2d umap to one that also contains a 3D umap. I write the annotations to csvs, however, X is sparse, and making it dense and unchunked would severely hurt performance. As it just constains float values, it could be read by n5-Zarr, but that would require passing both the csvs and the h5ad file converted to Zarr for XtraDimensionVR to handle, which is not very elegant.
X
Sparse matrix csc and csr consist of three arrays, that, if read properly (such as with https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/linalg/SparseMatrix.html#apply(int,%20int) .apply(i, j), or writing a reader manually), will be able to fetch any non-zero value. Arrays only store value if non-zero, so will have to infer zero values. The only way to save a sparse matrix to disk from scipy is their proprietary function, which exists only in python. May be able to save the three arrays as csv, but length may be prohibitive.
ToDo: display scale bar for gene expression, as it is currently normalized to fit the color map that goes from 0 to 10.
If I read and write AnnData object (a dataset), it deconcatenates the observations out of a compound dataset form, but seems to compress repeating entries by replacing them with integers. However, the translations is not stored anywhere viewable and would need to be manually mapped for each dataset.
Read directly from h5ad using jHDF5. Reading from scanPy and writing to new hdf5 deconcatenates the compound dtasets into groups, allowing me to use jhdf5. Unfortunately also calles the strings_to_categoricals() function, requiring the maps to be stored ins 'uns' during the file conversion, to then be read in kotlin.
Could restructure the data to just have cells, genes, and main metadata.![image](https://user-images.githubusercontent.com/45041058/106532072-5a758f00-64a4-11eb-9850-4a595668bd81.png)