Add method to calculate embeddings for variable by distance aggregation

LLehner commented 4 months ago

Description

Adds a method in tools to calculate embeddings of variables by their counts aggregated by distance.

Example usage

import squidpy as sq

load example data set adata = sq.datasets.seqfish()

Calculate distances of each observation to a specified anchor point (e.g. cell type or tissue location). Here we use cell type "Endothelium" in the annotation column "celltype_mapped_refined": sq.tl.var_by_distance(adata, groups="Endothelium", cluster_key="celltype_mapped_refined")

The resulting distances are stored in adata.obsm["design_matrix"]. Now we can calculate the embeddings: sq.tl.var_embeddings(adata, group="Endothelium", design_matrix_key="design_matrix")

Note that by default the bin of distance 0, meaning the counts that belong to the anchor point, are excluded. This can be changed by setting include_anchor=True in sq.tl.var_embeddings().

By default 100 bins are used. The resulting embeddings are stored in adata.uns["100_bins_distance_embeddings"].

We can plot the embedding (umap) as follows: import matplotlib.pyplot as plt embedding = adata.uns["100_bins_distance_embeddings"] plt.scatter(embedding[0], embedding[1], c="grey") plt.gca().set_aspect('equal', 'datalim') plt.title('UMAP', fontsize=20)

which results in:

TODO

[ ] Add a plotting function so this doesn't need to be done manually.
[ ] Cluster the data

codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 33.33333% with 24 lines in your changes are missing coverage. Please review.

Project coverage is 69.75%. Comparing base (df8e042) to head (8ee07ba).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #807 +/- ## ========================================== - Coverage 69.99% 69.75% -0.24% ========================================== Files 39 40 +1 Lines 5525 5561 +36 Branches 1029 1037 +8 ========================================== + Hits 3867 3879 +12 - Misses 1363 1387 +24 Partials 295 295 ``` | [Files](https://app.codecov.io/gh/scverse/squidpy/pull/807?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scverse) | Coverage Δ | | |---|---|---| | [src/squidpy/tl/\_var\_embeddings.py](https://app.codecov.io/gh/scverse/squidpy/pull/807?src=pr&el=tree&filepath=src%2Fsquidpy%2Ftl%2F_var_embeddings.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scverse#diff-c3JjL3NxdWlkcHkvdGwvX3Zhcl9lbWJlZGRpbmdzLnB5) | `33.33% <33.33%> (ø)` | |

giovp commented 2 months ago

hi @LLehner , thank you for this, would you mind elaborating a bit when this would be used? also, what if the embedding are pre-calculated, or the user would like to use something other than the UMAP, should that be an option? finally, I think a test would be required before we get this in, thanks!

timtreis commented 2 months ago

Hey @giovp, this feature was coming out of a discussion with @maiiashulman. We ran into a situation in which the "literature-curated" signature for hypoxia was either 20 or 4000 genes, the latter obviously being useless. So we wondered which other genes maybe show the same spatially variable pattern as a function of distance to a certain cell-type (e.g. epithelial). This is essentially a graphical method to see if a given set of genes (f.e. the 20 gene signature) even varies in a similar pattern.

But I agree with your points; if we see that it's actually doing something useful, we should make it a bit more flexible.

scverse / squidpy