scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
571 stars 152 forks source link

Visualise/compare Anndata object structure #671

Open chris-rands opened 2 years ago

chris-rands commented 2 years ago

With large Anndata objects, I sometimes want check the attributes present. The print/repr call does this okay, but becomes hard to read with highly populated objects and not possible to compared two objects attributes progamatically.

Currently my workaround is to use a function to output a dict, which can then be parsed to JSON or printed etc., but I don't know if there is a better choice? Or indeed if such a function could be builtin to Anndata? e.g.

import scanpy as sc
import pprint

def repr_dict(adata):
    d = {}
    for attr in (
        "n_obs",
        "n_vars",
        "obs",
        "var",
        "uns",
        "obsm",
        "varm",
        "layers",
        "obsp",
        "varp",
    ):
        got_attr = getattr(adata, attr)
        if isinstance(got_attr, int):
            d[attr] = got_attr
        else:
            keys = list(got_attr.keys())
            if keys:
                d[attr] = keys
    return d

adata = sc.datasets.pbmc68k_reduced()

print(adata)
pprint.pprint(repr_dict(adata))

Outputs:

AnnData object with n_obs × n_vars = 700 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
{'n_obs': 700,
 'n_vars': 765,
 'obs': ['bulk_labels',
         'n_genes',
         'percent_mito',
         'n_counts',
         'S_score',
         'G2M_score',
         'phase',
         'louvain'],
 'obsm': ['X_pca', 'X_umap'],
 'obsp': ['distances', 'connectivities'],
 'uns': ['bulk_labels_colors',
         'louvain',
         'louvain_colors',
         'neighbors',
         'pca',
         'rank_genes_groups'],
 'var': ['n_counts',
         'means',
         'dispersions',
         'dispersions_norm',
         'highly_variable'],
 'varm': ['PCs']}
ivirshup commented 2 years ago

For viewing the object, this has been discussed a bit before in #346 and #521. I think this would be quite useful.

Currently, I think the best way forward here is just to steal how mudata does this. We do have permission from @gtca for this. I just haven't gotten around to it.


It could be nice to compare between anndata structures. I'm wondering if it may also be useful to be able to get "diffed" structures (otherwise this is a bit like adata1 == adata2, #644). There's already some code around the merging functionality which could be re-used here from the merge.py file.

ivirshup commented 2 years ago

I've opened #675 specifically for the visualization part.

chris-rands commented 2 years ago

Ok thanks, I didn't see those previous threads. Yes I think there are two different points. First, the viz part, which you've already covered (the mudata way looks good to me).

And second, the way to programmatically compare the anndata object structures. If there is a dict or similar with the attributes, then a simple "diff" implementation could compare the dicts via set operations or some kind of stringification and comparison using difflib

ivirshup commented 2 years ago

Would you like to compare just structures or values as well?

For values, I think this gets a little complicated due to behavior of broadcasting and arrays. Our internal assert_equal function has a lot of the required logic for doing comparisons in it, but can be quite slow.