scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
575 stars 152 forks source link

Concatenate enhancements #295

Closed chris-rands closed 2 years ago

chris-rands commented 4 years ago

I know that the AnnData.concatenate() docstring says "The uns, varm and obsm attributes are ignored." , but I still wondered if there is any light-weight way to do this?

My use case here is I'm running bbknn on several datasets separately and then I want to merge the neighbourhood graphs after. Thanks!

flying-sheep commented 4 years ago

Well, we’ll migrate that to .obsp very soon, and therefore we might* concatenate .obsp but definitely not stuff in .uns. We might merge .uns instead…

*might because graphs should probably be recalculated anyway: If you concatenate graphs, you get a graphs with disconnected parts, represented by a block diagonal matrix:

ivirshup commented 4 years ago

I think we'll allow merging obsp, since there are cases where it's valid, but we should make it easy to exclude items from merges.

uns is more difficult. I think it'll be difficult to tell which things are okay to merge, so I think I'd like to just drop it all/ have the user select a canonical version. Colors may be an exception to this.

flying-sheep commented 4 years ago

I think we could offer

def concatenate(
    ...,
    merge_uns: Literal[
        'if_clean',  # the default, throws an error on key collision, else just merges
        'always',  # ignores key collision, just overwrite earlier with later items
        'never',  # to avoid unwanted scanpy behavior, just return it empty
    ] = None,
):
    ...

Maybe you can come up with more.

flying-sheep commented 4 years ago

Hi @giovp, would such an option in concatenate work for your spatial use case?

Similarly: You need to concat obsm['X_spatial'], so we should just generally concatenate .obsm entries

ivirshup commented 4 years ago

@flying-sheep, re - your last comment. I'm thinking of just providing the same options as xarray.

@giovp, could you go over how you store images in the Anndata object currently?

flying-sheep commented 4 years ago

Link please

ivirshup commented 4 years ago

Here's the narrative docs, but the mainly the options for xr.concat are what I'm thinking of.

flying-sheep commented 4 years ago

Ha, that’s exactly the same idea as I had independently. Seems like indication that it’s a good one.

giovp commented 4 years ago

@ivirshup , images and scale factors are stored in Anndata as:

Both of them are dictionaries that contains other elements, e.g.

In both cases, those elements needs to be retained after concatenation (e.g. with a list)

@flying-sheep yes that would work, one thing that needs to be taken into account is the shape of X_spatial in case heterogeneous anndatas (spatial and not spatial) are concatenated. The easiest solution I believe would be to add zeros to concatenated_adata.obsm["X_spatial"] if the original anndata did not have spatial coordinates

flying-sheep commented 4 years ago

Yeah, I’d say we merge dictionaries recursively based on the selected strategy.

I’m not sure if concatenating lists in uns is a good idea, so we should rethink if adata.uns["images"]["hires"] = [foo, bar] should be legal or only adata.uns["images"]["hires"] = {"foo": foo, "bar": bar}.

The easiest solution I believe would be to add zeros to concatenated_adata.obsm["X_spatial"] if the original anndata did not have spatial coordinates

Or NaN. Or make it a masked array.

giovp commented 4 years ago

How are we gonna proceed with this? I would like to have this working rather soon so I'm happy to work on it and submit a PR but some more indications would be useful:

Thank you!

flying-sheep commented 4 years ago

@ivirshup linked xarray.concat, which has a few parameter which control how merging works in there. I think it makes sense to figure out what options they provide and use the same ones if applicable.

I’m not familiar with how “data variables”, “coordinate variables” and “non-concatenated variables” match with our stuff (I assume the latter is like .uns though), but maybe @ivirshup can help.

Can we decide on how to go for the adata.uns concatenation? Do we want a list? A dictionary?

I think we should support merging dictionaries. Depending on above-mentioned merge control parameters, we should recursively merge them until we hit some non-dict value that isn’t identical to the non-dict value at the same position. So if

adata1.uns = dict(images=dict(hires=dict(blah=image_blah)))
adata2.uns = dict(images=dict(hires=dict(geh=image_geh)))

then the merge succeeds in any case because the two non-dict elements ['images']['hires']['blah'] and ['images']['hires']['geh'] don’t collide. But if we have

adata1.uns = dict(images=dict(hires=dict(blah=image_blah)))
adata2.uns = dict(images=dict(hires=dict(blah=image_geh)))

and image_blah != image_geh, it would fail if the merge parameter isn’t e.g. 'override'.

Does that make sense?

giovp commented 4 years ago

I agree wrt merging dictionaries. I can try to implement something soon and submit a PR. wrt xarray I am not sure how you want to proceed with that. I'll wait for @ivirshup for more inputs.

flying-sheep commented 4 years ago

@ivirshup most likely meant the data_vars/coords parameters. Their docs say:

’minimal’: Only data variables in which the dimension already appears are included. ’different’: Data variables which are not equal (ignoring attributes) across all datasets are also concatenated (as well as the “minimal” ones) […] ’all’: All data variables will be concatenated. list of str: The listed data variables will be concatenated, in addition to the ‘minimal’ data variables.

Can’t say I understand the “different” strategy.

ivirshup commented 4 years ago

I'll be able to give a bit more feedback on this over the weekend. A few points for now:

giovp commented 4 years ago
  • What is type(adata.uns['images']['hires'])?

For a single adata object, the value would be an image stored as ndarray. For a concatenated adata object, the value would be another dictionary with keys corresponding to concat index (e.g. value in batch column) and values still ndarrays

  • After two objects containing images have been concatenated, how do I tell which obs go with which image?

I would just use the concat index as key for the dict. That would then be same value stored in adata.obs['batch']

giovp commented 4 years ago

wrt to concatenating uns, would something like this works?

new_adata.uns = dict()
for idx, _adata in enumerate(all_adatas):
            for k in _adata.uns.keys():
                if k not in new_adata.uns.keys():
                    new_adata.uns[k] = dict()
                for k1, v1 in _adata.uns[k].items():
                    if k1 not in new_adata.uns[k].keys():
                        new_adata.uns[k][k1] = dict()
                    new_adata.uns[k][k1][str(idx)] = v1

I though about about this @flying-sheep

adata1.uns = dict(images=dict(hires=dict(blah=image_blah)))
adata2.uns = dict(images=dict(hires=dict(blah=image_geh)))

and I don't think this will ever be the case, because blah will always be the index of the adata object. There should never be conflicting keys. The only conflict would be if two different values are named with the same key in two different adatas, but why should this be the case?

wrt to xarray.concat, @ivirshup shall we really go for this instead of masked arrays? Do you think is necessary to rewrite the whole concatenate using xarray.concat or just an additional if statement where adata.obs do not overlap but you still want them concatenated and filled them. in with e.g. nan entries?

flying-sheep commented 4 years ago

Looks good, except you should do it recursive for infinite levels, and we do empty dictionaries as {}. The dict constructor is used because it fits best when you have dict(string_key_with_underscores=value)

ivirshup commented 4 years ago

I don't think you need to rewrite this using xarray concat, we should have functionality like that eventually, but we can't rely on xarray.concat because it doesn't do sparse data. I've been attempting to basically copy the functionality, but it's been quite a bit of work and is out of scope for this problem.

I'm looking at how spatial is being stored, and think this could be rearranged to make some of these problems easier. I think this should be like:

uns = {
    "spatial": {
        "library1":
            {
                "images": {...},
                "scalefactors": {...},  # What was the plan for these with concatenation?
                "metadata": {...},  # This should include what version of visium was read
            },
        }
    }
}

When reading in a visium dataset, the library key can be set by the file.attrs["library_id"] value from the visium h5 file. This also gives us a natural batch_key value. Another advantage is that I think this will help us future proof for the next version of visium, by not assuming a shared set of metadata.

I don't think we should go with what you've outlined above. We shouldn't return a tree with a different structure than what went into it. The structure should be preserved. With the structure I've outlined, concatenation will just be adding new entries to the spatial dict.

giovp commented 4 years ago

Following theislab/scanpy#1105, this is a proposed function to concatenate uns objects in anndata.

def _concatenate_uns(
    adata_uns_list, 
    merge_uns: Literal[
        'if_clean',  # the default, throws an error on key collision, else just merges
        'always',  # ignores key collision, just overwrite earlier with later items
        'never',  # to avoid unwanted scanpy behavior, just return it empty
        ] = None,
    ):

    uns = {}
    if merge_uns != 'never' and merge_uns is not None:
        for adata_uns in adata_uns_list:
            for k in adata_uns.keys():
                if k not in uns.keys():
                    uns[k] = dict()
                if k in uns.keys() and isinstance(adata_uns[k],dict):
                    for k1 in adata_uns[k].keys():
                        if k1 not in uns[k].keys():
                            uns[k][k1] = adata_uns[k][k1]
                        elif k1 in uns[k].keys():
                            if merge_uns == "always":
                                uns[k][k1] = adata_uns[k][k1]
                elif k in uns.keys():
                    if merge_uns == "always":
                        uns[k] = adata_uns[k]

        if merge_uns=="if_clean":
            uns = { k : v for k,v in uns.items() if len(v.keys()) >= 2}

    return uns

This is how it behaves:

visium_1_uns = {"spatial":{"library1":{"images":{"hires":0,"lowres":1}, "scalefactors":{"hires_scale":0,"lowres_scale":1}}}, "pca":0}
visium_2_uns = {"spatial":{"library2":{"images":{"hires":2}, "scalefactors":{"hires_scale":2}}}, "pca":2}
chromium_1_uns = {"pca":3}
all_adata_uns = [visium_1_uns,visium_2_uns,chromium_1_uns]
_concatenate_uns(all_adata_uns, merge_uns = 'if_clean')
{'spatial': {'library1': {'images': {'hires': 0, 'lowres': 1},
   'scalefactors': {'hires_scale': 0, 'lowres_scale': 1}},
  'library2': {'images': {'hires': 2}, 'scalefactors': {'hires_scale': 2}}}}

_concatenate_uns(all_adata_uns, merge_uns = 'always')
{'spatial': {'library1': {'images': {'hires': 0, 'lowres': 1},
   'scalefactors': {'hires_scale': 0, 'lowres_scale': 1}},
  'library2': {'images': {'hires': 2}, 'scalefactors': {'hires_scale': 2}}},
 'pca': 3}

_concatenate_uns(all_adata_uns, merge_uns = 'never')
{}

The reason for just looking at the first 2 levels of the dictionary keys is because the library_id sits on the second level in spatial and that's the only key that needs to be concatenated (for now).

The merge_uns = 'always' arg is useful when we have both visium and chromium datasets

Looking forward to hear what you think @ivirshup @flying-sheep

chris-rands commented 4 years ago

without commenting on the logic, i think you can define an empty dict thorough a literal {} rather than dict() and also omit all the .keys()

ivirshup commented 4 years ago

Update on this:

What still doesn't get merged:

flying-sheep commented 4 years ago

AnnDatas along the variable axis.

Well, it’s called “concatenate”, which implies that it’s along the first (obs) axis

ivirshup commented 4 years ago

Well, it’s called “concatenate”, which implies that it’s along the first (obs) axis

Is it? I think it's meant to be along one axis. I'm thinking we should allow concatenation along var as well as obs. E.g. something like: concatenate([a.T for a in adatas]).T, but in an efficient way.