Closed chris-rands closed 2 years ago
Well, we’ll migrate that to .obsp
very soon, and therefore we might* concatenate .obsp
but definitely not stuff in .uns
. We might merge .uns
instead…
*might because graphs should probably be recalculated anyway: If you concatenate graphs, you get a graphs with disconnected parts, represented by a block diagonal matrix:
I think we'll allow merging obsp
, since there are cases where it's valid, but we should make it easy to exclude items from merges.
uns
is more difficult. I think it'll be difficult to tell which things are okay to merge, so I think I'd like to just drop it all/ have the user select a canonical version. Colors may be an exception to this.
I think we could offer
def concatenate(
...,
merge_uns: Literal[
'if_clean', # the default, throws an error on key collision, else just merges
'always', # ignores key collision, just overwrite earlier with later items
'never', # to avoid unwanted scanpy behavior, just return it empty
] = None,
):
...
Maybe you can come up with more.
Hi @giovp, would such an option in concatenate work for your spatial use case?
Similarly: You need to concat obsm['X_spatial']
, so we should just generally concatenate .obsm
entries
@flying-sheep, re - your last comment. I'm thinking of just providing the same options as xarray.
@giovp, could you go over how you store images in the Anndata object currently?
Link please
Here's the narrative docs, but the mainly the options for xr.concat
are what I'm thinking of.
Ha, that’s exactly the same idea as I had independently. Seems like indication that it’s a good one.
@ivirshup , images and scale factors are stored in Anndata as:
adata.uns['images']
adata.uns['scalefactors']
. Both of them are dictionaries that contains other elements, e.g.
adata.uns['images']['hires']
for high resolution images (stored as numpy arrays)adata.uns['scalefactors']['spot_diameter_fullres']
for spot diamteres (stored as float)In both cases, those elements needs to be retained after concatenation (e.g. with a list)
@flying-sheep yes that would work, one thing that needs to be taken into account is the shape of X_spatial
in case heterogeneous anndatas (spatial and not spatial) are concatenated. The easiest solution I believe would be to add zeros to concatenated_adata.obsm["X_spatial"]
if the original anndata did not have spatial coordinates
Yeah, I’d say we merge dictionaries recursively based on the selected strategy.
I’m not sure if concatenating lists in uns
is a good idea, so we should rethink if adata.uns["images"]["hires"] = [foo, bar]
should be legal or only adata.uns["images"]["hires"] = {"foo": foo, "bar": bar}
.
The easiest solution I believe would be to add zeros to
concatenated_adata.obsm["X_spatial"]
if the original anndata did not have spatial coordinates
Or NaN
. Or make it a masked array.
How are we gonna proceed with this? I would like to have this working rather soon so I'm happy to work on it and submit a PR but some more indications would be useful:
Thank you!
@ivirshup linked xarray.concat
, which has a few parameter which control how merging works in there. I think it makes sense to figure out what options they provide and use the same ones if applicable.
I’m not familiar with how “data variables”, “coordinate variables” and “non-concatenated variables” match with our stuff (I assume the latter is like .uns though), but maybe @ivirshup can help.
Can we decide on how to go for the adata.uns concatenation? Do we want a list? A dictionary?
I think we should support merging dictionaries. Depending on above-mentioned merge control parameters, we should recursively merge them until we hit some non-dict value that isn’t identical to the non-dict value at the same position. So if
adata1.uns = dict(images=dict(hires=dict(blah=image_blah)))
adata2.uns = dict(images=dict(hires=dict(geh=image_geh)))
then the merge succeeds in any case because the two non-dict elements ['images']['hires']['blah']
and ['images']['hires']['geh']
don’t collide. But if we have
adata1.uns = dict(images=dict(hires=dict(blah=image_blah)))
adata2.uns = dict(images=dict(hires=dict(blah=image_geh)))
and image_blah != image_geh
, it would fail if the merge parameter isn’t e.g. 'override'
.
Does that make sense?
I agree wrt merging dictionaries. I can try to implement something soon and submit a PR. wrt xarray I am not sure how you want to proceed with that. I'll wait for @ivirshup for more inputs.
@ivirshup most likely meant the data_vars/coords parameters. Their docs say:
’minimal’: Only data variables in which the dimension already appears are included. ’different’: Data variables which are not equal (ignoring attributes) across all datasets are also concatenated (as well as the “minimal” ones) […] ’all’: All data variables will be concatenated. list of str: The listed data variables will be concatenated, in addition to the ‘minimal’ data variables.
Can’t say I understand the “different” strategy.
I'll be able to give a bit more feedback on this over the weekend. A few points for now:
different
: If you split an object which had PCA computed on it, all the variable loading will be the same. You don't want to concatenate those, but you might want to concatenate other values in varm
that were added to each subset.type(adata.uns['images']['hires'])
?obs
go with which image?
- What is
type(adata.uns['images']['hires'])
?
For a single adata object, the value would be an image stored as ndarray. For a concatenated adata object, the value would be another dictionary with keys corresponding to concat index (e.g. value in batch column) and values still ndarrays
- After two objects containing images have been concatenated, how do I tell which
obs
go with which image?
I would just use the concat index as key for the dict. That would then be same value stored in adata.obs['batch']
wrt to concatenating uns
, would something like this works?
new_adata.uns = dict()
for idx, _adata in enumerate(all_adatas):
for k in _adata.uns.keys():
if k not in new_adata.uns.keys():
new_adata.uns[k] = dict()
for k1, v1 in _adata.uns[k].items():
if k1 not in new_adata.uns[k].keys():
new_adata.uns[k][k1] = dict()
new_adata.uns[k][k1][str(idx)] = v1
I though about about this @flying-sheep
adata1.uns = dict(images=dict(hires=dict(blah=image_blah)))
adata2.uns = dict(images=dict(hires=dict(blah=image_geh)))
and I don't think this will ever be the case, because blah
will always be the index of the adata object. There should never be conflicting keys. The only conflict would be if two different values are named with the same key in two different adatas, but why should this be the case?
wrt to xarray.concat
, @ivirshup shall we really go for this instead of masked arrays? Do you think is necessary to rewrite the whole concatenate using xarray.concat
or just an additional if statement where adata.obs
do not overlap but you still want them concatenated and filled them. in with e.g. nan
entries?
Looks good, except you should do it recursive for infinite levels, and we do empty dictionaries as {}
. The dict constructor is used because it fits best when you have dict(string_key_with_underscores=value)
I don't think you need to rewrite this using xarray concat, we should have functionality like that eventually, but we can't rely on xarray.concat because it doesn't do sparse data. I've been attempting to basically copy the functionality, but it's been quite a bit of work and is out of scope for this problem.
I'm looking at how spatial is being stored, and think this could be rearranged to make some of these problems easier. I think this should be like:
uns = {
"spatial": {
"library1":
{
"images": {...},
"scalefactors": {...}, # What was the plan for these with concatenation?
"metadata": {...}, # This should include what version of visium was read
},
}
}
}
When reading in a visium dataset, the library key can be set by the file.attrs["library_id"]
value from the visium h5 file. This also gives us a natural batch_key
value. Another advantage is that I think this will help us future proof for the next version of visium, by not assuming a shared set of metadata.
I don't think we should go with what you've outlined above. We shouldn't return a tree with a different structure than what went into it. The structure should be preserved. With the structure I've outlined, concatenation will just be adding new entries to the spatial dict.
Following theislab/scanpy#1105, this is a proposed function to concatenate uns
objects in anndata.
def _concatenate_uns(
adata_uns_list,
merge_uns: Literal[
'if_clean', # the default, throws an error on key collision, else just merges
'always', # ignores key collision, just overwrite earlier with later items
'never', # to avoid unwanted scanpy behavior, just return it empty
] = None,
):
uns = {}
if merge_uns != 'never' and merge_uns is not None:
for adata_uns in adata_uns_list:
for k in adata_uns.keys():
if k not in uns.keys():
uns[k] = dict()
if k in uns.keys() and isinstance(adata_uns[k],dict):
for k1 in adata_uns[k].keys():
if k1 not in uns[k].keys():
uns[k][k1] = adata_uns[k][k1]
elif k1 in uns[k].keys():
if merge_uns == "always":
uns[k][k1] = adata_uns[k][k1]
elif k in uns.keys():
if merge_uns == "always":
uns[k] = adata_uns[k]
if merge_uns=="if_clean":
uns = { k : v for k,v in uns.items() if len(v.keys()) >= 2}
return uns
This is how it behaves:
visium_1_uns = {"spatial":{"library1":{"images":{"hires":0,"lowres":1}, "scalefactors":{"hires_scale":0,"lowres_scale":1}}}, "pca":0}
visium_2_uns = {"spatial":{"library2":{"images":{"hires":2}, "scalefactors":{"hires_scale":2}}}, "pca":2}
chromium_1_uns = {"pca":3}
all_adata_uns = [visium_1_uns,visium_2_uns,chromium_1_uns]
_concatenate_uns(all_adata_uns, merge_uns = 'if_clean')
{'spatial': {'library1': {'images': {'hires': 0, 'lowres': 1},
'scalefactors': {'hires_scale': 0, 'lowres_scale': 1}},
'library2': {'images': {'hires': 2}, 'scalefactors': {'hires_scale': 2}}}}
_concatenate_uns(all_adata_uns, merge_uns = 'always')
{'spatial': {'library1': {'images': {'hires': 0, 'lowres': 1},
'scalefactors': {'hires_scale': 0, 'lowres_scale': 1}},
'library2': {'images': {'hires': 2}, 'scalefactors': {'hires_scale': 2}}},
'pca': 3}
_concatenate_uns(all_adata_uns, merge_uns = 'never')
{}
The reason for just looking at the first 2 levels of the dictionary keys is because the library_id
sits on the second level in spatial
and that's the only key that needs to be concatenated (for now).
The merge_uns = 'always'
arg is useful when we have both visium and chromium datasets
Looking forward to hear what you think @ivirshup @flying-sheep
without commenting on the logic, i think you can define an empty dict thorough a literal {}
rather than dict()
and also omit all the .keys()
Update on this:
uns
can now be merged with a few different strategiesobsm
entries now respect the join
argumentlayers
now can be merged with join="outer"
What still doesn't get merged:
obsp
varp
varm
AnnData
s along the variable axis.
AnnData
s along the variable axis.
Well, it’s called “concatenate”, which implies that it’s along the first (obs) axis
Well, it’s called “concatenate”, which implies that it’s along the first (obs) axis
Is it? I think it's meant to be along one axis. I'm thinking we should allow concatenation along var
as well as obs
. E.g. something like: concatenate([a.T for a in adatas]).T
, but in an efficient way.
I know that the
AnnData.concatenate()
docstring says "The uns, varm and obsm attributes are ignored." , but I still wondered if there is any light-weight way to do this?My use case here is I'm running bbknn on several datasets separately and then I want to merge the neighbourhood graphs after. Thanks!