vitessce / vitessce-data

Utils for loading HuBMAP data formats
MIT License
6 stars 4 forks source link

Refactor for Packaging #110

Open ilan-gold opened 4 years ago

ilan-gold commented 4 years ago

Overview

Right now we have code all over the place for creating Vitessce data/configs:

https://github.com/hubmapconsortium/portal-containers https://github.com/hubmapconsortium/vitessce-data https://github.com/hubmapconsortium/portal-ui/blob/master/context/app/api/vitessce.py

This is problematic as it makes launching new Vitessce configs difficult and hard to communicate to people not familiar with out code. This problem is only going to expand, and as we gain users (probably other data portals), it would be good to have not only schemas for validating the data, but a way of reliably generating the data.

The overarching goal here is to take in a Pandas dataframe and output compliant Arrow (in the future), Zarr, OME-TIFF, and JSON data for Vitessce. A secondary goal could be to also create Vitessce configurations based on what data has been generated - basically pre-defined view configurations based on certain standard inputs (i.e a genes/clusters + raster + cells/cell-sets without scatterplot gives what we have for CODEX, and with scatterplot gives Linnarsson minus one of the scatterplots).

I'll organize this issue by data type.

Genes/Clusters (Heatmap)

Our genes and clusters schema convey very similar information, i.e data per observation and a max for rendering. We should think about merging these, if possible, since if we can show one, we can show the other:

https://github.com/hubmapconsortium/portal-containers/blob/fb1910324fc796ff4b7d4e643de27ff2861e7d8c/containers/sprm-to-json/context/main.py#L125-L160

https://github.com/hubmapconsortium/vitessce-data/blob/master/python/cluster.py

https://github.com/hubmapconsortium/vitessce-data/blob/master/snakemake/satija/src/convert_h5ad_to_zarr.py

This might require an arrow loader if it's too hard to parse out data properly using only one schema in the client across the two use cases, since they are used differently.

In any case, I think a function that takes in a Pandas DataFrame containing a Cell x Gene matrix and outputs JSON/Arrow should be the goal here. The index of such a DataFrame would be cell names and the column names genes. This will help with Cells/Cell-Sets.

df_genes
            Actin       CD107a        CD11c       CD20          CD21  CD31         CD3e          CD4         CD45        CD45RO         CD68           CD8       DAPI_2         E_CAD   Histone_H3          Ki67  Pan_CK    Podoplanin
Unnamed: 0                                                                                                                                                                                                                            
1             0.0  3825.083089  2172.038856   0.000000  13118.704545   0.0  2619.149560  2258.743646  3018.150782  13766.025415  2475.430352  17811.810362  2472.491447  13831.021750  2155.434995  12023.281769     0.0  12854.526882
2             0.0  3158.566135  1905.015101   6.866331   9662.850531   0.0  2279.843261  2059.656600  2866.507131   9865.706096  2220.703160  10513.558166  1972.618289  10445.596337  1802.067673   8310.784396     0.0   9166.099972
3             0.0  2112.107533  1464.033661   0.935408   8152.397926   0.0  1778.593705  1477.261827  2401.413574   7463.324054  1703.527838   6728.968341  2594.646470   8001.948144  1467.260735   6173.303675     0.0   7050.821325
4             0.0  2409.139601  1568.258547  30.035613  12435.782407   0.0  1835.470442  1643.249288  2789.540598   7843.279558  1962.359687   7357.050570  2328.332977  11190.447293  1503.501068   6625.033120     0.0   8061.569801
5             0.0  1789.038279  1165.606538  23.199695   6595.104505   0.0  1401.826389  1163.010501  1994.819783   5216.277778  1378.526423   4899.289804  1745.914973   6385.073679  1220.704268   4540.830454     0.0   4463.399051
...           ...          ...          ...        ...           ...   ...          ...          ...          ...           ...          ...           ...          ...           ...          ...           ...     ...           ...
2653          0.0  1528.167373  1040.252119  71.731638   9857.117232   0.0  1133.142655  1081.707627  2482.951977   5863.394068  1245.564972   6276.619350  2695.375000   7168.248588  1072.548729   5214.332627     0.0   5677.270480
2654          0.0   866.767553   579.135481   7.370484   3924.449898   0.0   698.100375   555.293286  1207.978357   2482.735515   713.964724   1805.677062  1886.900818   2124.561350   615.980061   1431.171097     0.0   1684.441207
2655          0.0  1534.898357   949.947653   1.008920   6614.136854   0.0  1718.979343  1471.665023  1850.167840   6816.869014  1180.052113   4810.176761  1911.350939   5107.615493   918.007746   4728.398592     0.0   5064.655399
2656          0.0  1643.330193  1080.667150  23.054348   6832.027778   0.0  1456.217874  1124.606763  2271.074879   5281.138406  1362.480193   5671.768116  1566.910870   5627.569565   986.648792   4990.973913     0.0   5253.209420
2657          0.0  2407.073093  2120.567444   2.307910  12124.994703   0.0  4122.323093  3009.756356  3979.926907  14120.478814  2581.693856  12566.961511  2934.979520  11720.578390  1956.343220  11260.825212     0.0  12085.653249

[2657 rows x 18 columns]
>>> generate_cell_by_gene(df_genes)

Cell-Sets/Cells

@keller-mark knows best (feel free to comment/edit this issue!) but this is a little bit more complicated since the two are intertwined, but not necessary/sufficient in both directions (like the above); that is, one could have "Cells" without "Cell-sets" but not really "Cell-Sets" without "Cells."

Like the above we want a function that takes in a Pandas DataFrame and outputs JSON/Arrow but the structure for the DataFrame is a little bit hairier (not just a labeled Cell x Gene matrix where the labels are basically unchecked). I foresee us needing to either strongly define an API or rely on a properly named DataFrame (i.e each column has a specific name like poly or xy). I think we should probably go the route of an API so we have something like:

>>> df
                                                        Shape  Actin       CD107a        CD11c       CD20          CD21  CD31         CD3e  ...          Ki67  Pan_CK    Podoplanin  Mean  Covariance  Total  Mean All  Shape Vectors
id                                                                                                                                  ...                                                                                      
1           [[0.0, 100.5], [1.0232, 100.5232], [1.7536, 10...    0.0  3825.083089  2172.038856   0.000000  13118.704545   0.0  2619.149560  ...  12023.281769     0.0  12854.526882     4           6      6         2              3
2           [[0.0, 130.5], [1.0798, 130.5798], [1.8667, 13...    0.0  3158.566135  1905.015101   6.866331   9662.850531   0.0  2279.843261  ...   8310.784396     0.0   9166.099972     2           2      3         3              3
3           [[0.0, 647.5], [0.6596, 646.8404], [1.4515, 64...    0.0  2112.107533  1464.033661   0.935408   8152.397926   0.0  1778.593705  ...   6173.303675     0.0   7050.821325     6           2      6         4              1
4           [[0.4782, 736.0218], [0.4782, 736.0218], [0.95...    0.0  2409.139601  1568.258547  30.035613  12435.782407   0.0  1835.470442  ...   6625.033120     0.0   8061.569801     6           2      1         4              2
5           [[0.9636, 890.5], [0.9636, 890.5], [1.6556, 89...    0.0  1789.038279  1165.606538  23.199695   6595.104505   0.0  1401.826389  ...   4540.830454     0.0   4463.399051     3           2      1         1              1
...                                                       ...    ...          ...          ...        ...           ...   ...          ...  ...           ...     ...           ...   ...         ...    ...       ...            ...
2653        [[1005.0357, 298.5], [1005.5179, 298.5], [1005...    0.0  1528.167373  1040.252119  71.731638   9857.117232   0.0  1133.142655  ...   5214.332627     0.0   5677.270480     6           1      2         4              2
2654        [[1006.0, 531.5], [1004.9692, 531.4692], [1004...    0.0   866.767553   579.135481   7.370484   3924.449898   0.0   698.100375  ...   1431.171097     0.0   1684.441207     1           1      2         6              3
2655        [[1005.193, 599.5], [1005.193, 599.5], [1004.5...    0.0  1534.898357   949.947653   1.008920   6614.136854   0.0  1718.979343  ...   4728.398592     0.0   5064.655399     3           2      1         1              3
2656        [[1005.233, 754.5], [1005.233, 754.5], [1004.4...    0.0  1643.330193  1080.667150  23.054348   6832.027778   0.0  1456.217874  ...   4990.973913     0.0   5253.209420     3           2      1         1              3
2657        [[1006.0, 389.5], [1005.4694, 390.0306], [1004...    0.0  2407.073093  2120.567444   2.307910  12124.994703   0.0  4122.323093  ...  11260.825212     0.0  12085.653249     2           4      1         2              3

[2657 rows x 24 columns]

generate_cells(df, poly="Shape", genes=["CD11c", "CD20", ...], factors=["Mean", "Mean All", ...]....) 

where each string argument is a column in the dataframe df to be put into the json portion corresponding roughly to the arg key. The index of this dataframe will be cell ids, just like the above.

I think Cell_sets is going to be a little harder. Maybe you could add something about this @keller-mark here in terms of what input data could look like.

Raster

This one is tricky as well. We should probably support both tiff and zarr via a flag. We'll need to set up the docker container for bioformats2raw/raw2ometiff as a dependency (which I think can be done via the setup.py file). Beyond that, the other major paint point will be input data. Are we expecting numpy arrays? dask arrays? zarr stores? File paths? Perhaps all 4 can be possible?

generate_raster(ome_tiff="/path/to/my_file.ome.tif", output_tiff=True)
# or
generate_raster(np_array=my_image, output_zarr=True)

@manzt can probably comment on this as well. I Imagine most people will input OME-TIFF to bioformats2raw but I think we can also handle other inputs and use our custom pyramid generator or something python-specific (in contrast to bioformats2raw) that Glencoe writes.

Molecules

I think this will be relatively straightforward like the genes data - I think an input data frame with the index being molecule names plugged into an API is what we will use:

>>> df
             x_um         y_um
gene                          
Gad2  1278.683956  6020.642260
Gad2  1326.970330  6023.884788
Gad2  1292.026844  6059.337093
Gad2  1300.886241  6097.786264
Gad2  1232.410068  6102.884182
...           ...          ...
Mup5  3161.427603  5192.594981
Mup5  3099.698528  5221.596008
Mup5  3084.582240  5297.234605
Mup5  3054.192051  5342.142346
Mup5  3058.963217  5348.150185

[3841412 rows x 2 columns]

>>> generate_molecules(df, x="x_um", y="y_um")
keller-mark commented 4 years ago

For genes/clusters/heatmap, we should also support conversion to Zarr, as that is what I have used for the Satija data, and we already have a "loader" for that data type in Vitessce. I would vote to no longer support the genes.json and clusters.json formats since a Zarr-based format can replace them and they do not scale well https://github.com/hubmapconsortium/vitessce-data/blob/master/snakemake/satija/src/convert_h5ad_to_zarr.py#L41 If we want to support Arrow as well then a loader will need to be written for it https://github.com/hubmapconsortium/vitessce/tree/master/src/loaders

87830349-d84ab380-c84e-11ea-9538-95978cc8fc79

For cell sets, as far as what conversions to support, my thought would be:

Screen Shot 2020-07-28 at 4 35 42 PM