Open ilan-gold opened 4 years ago
For genes/clusters/heatmap, we should also support conversion to Zarr, as that is what I have used for the Satija data, and we already have a "loader" for that data type in Vitessce. I would vote to no longer support the genes.json and clusters.json formats since a Zarr-based format can replace them and they do not scale well https://github.com/hubmapconsortium/vitessce-data/blob/master/snakemake/satija/src/convert_h5ad_to_zarr.py#L41 If we want to support Arrow as well then a loader will need to be written for it https://github.com/hubmapconsortium/vitessce/tree/master/src/loaders
For cell sets, as far as what conversions to support, my thought would be:
support conversion from a dataframe that matches the CSV format that we receive from the Satija lab (flat cell type annotations that we expand into a hierarchy using the EBI Cell Ontology)
if someone wants a full hierarchy consisting of something else they will need to define the full JSON themselves, but something else that this package could offer is validation (in python, against the same JSON schemas that we use in the client)
Overview
Right now we have code all over the place for creating Vitessce data/configs:
https://github.com/hubmapconsortium/portal-containers https://github.com/hubmapconsortium/vitessce-data https://github.com/hubmapconsortium/portal-ui/blob/master/context/app/api/vitessce.py
This is problematic as it makes launching new Vitessce configs difficult and hard to communicate to people not familiar with out code. This problem is only going to expand, and as we gain users (probably other data portals), it would be good to have not only schemas for validating the data, but a way of reliably generating the data.
The overarching goal here is to take in a Pandas dataframe and output compliant Arrow (in the future), Zarr, OME-TIFF, and JSON data for Vitessce. A secondary goal could be to also create Vitessce configurations based on what data has been generated - basically pre-defined view configurations based on certain standard inputs (i.e a genes/clusters + raster + cells/cell-sets without scatterplot gives what we have for CODEX, and with scatterplot gives Linnarsson minus one of the scatterplots).
I'll organize this issue by data type.
Genes/Clusters (Heatmap)
Our
genes
andclusters
schema convey very similar information, i.e data per observation and amax
for rendering. We should think about merging these, if possible, since if we can show one, we can show the other:https://github.com/hubmapconsortium/portal-containers/blob/fb1910324fc796ff4b7d4e643de27ff2861e7d8c/containers/sprm-to-json/context/main.py#L125-L160
https://github.com/hubmapconsortium/vitessce-data/blob/master/python/cluster.py
https://github.com/hubmapconsortium/vitessce-data/blob/master/snakemake/satija/src/convert_h5ad_to_zarr.py
This might require an arrow loader if it's too hard to parse out data properly using only one schema in the client across the two use cases, since they are used differently.
In any case, I think a function that takes in a Pandas DataFrame containing a Cell x Gene matrix and outputs JSON/Arrow should be the goal here. The index of such a DataFrame would be cell names and the column names genes. This will help with
Cells
/Cell-Sets
.Cell-Sets/Cells
@keller-mark knows best (feel free to comment/edit this issue!) but this is a little bit more complicated since the two are intertwined, but not necessary/sufficient in both directions (like the above); that is, one could have "Cells" without "Cell-sets" but not really "Cell-Sets" without "Cells."
Like the above we want a function that takes in a Pandas DataFrame and outputs JSON/Arrow but the structure for the DataFrame is a little bit hairier (not just a labeled Cell x Gene matrix where the labels are basically unchecked). I foresee us needing to either strongly define an API or rely on a properly named DataFrame (i.e each column has a specific name like
poly
orxy
). I think we should probably go the route of an API so we have something like:where each string argument is a column in the dataframe
df
to be put into the json portion corresponding roughly to thearg
key. The index of this dataframe will be cell ids, just like the above.I think
Cell_sets
is going to be a little harder. Maybe you could add something about this @keller-mark here in terms of what input data could look like.Raster
This one is tricky as well. We should probably support both
tiff
andzarr
via a flag. We'll need to set up the docker container forbioformats2raw
/raw2ometiff
as a dependency (which I think can be done via thesetup.py
file). Beyond that, the other major paint point will be input data. Are we expectingnumpy
arrays?dask
arrays?zarr
stores? File paths? Perhaps all 4 can be possible?@manzt can probably comment on this as well. I Imagine most people will input
OME-TIFF
tobioformats2raw
but I think we can also handle other inputs and use our custom pyramid generator or something python-specific (in contrast tobioformats2raw
) that Glencoe writes.Molecules
I think this will be relatively straightforward like the genes data - I think an input data frame with the index being molecule names plugged into an API is what we will use: