Visualization gene expression by distance

giovp commented 3 years ago

Type of the feature

[ ] New function in sqduipy.im?
[x] New function in squidpy.gr?
[x] New function in squidpy.pl?
[ ] Change an existing functionality, such as default behavior?
[ ] Other?

Description

Pinging @jwrth (recent) and @almaan (not so recent) discussion on exactly the same analysis idea that could be added in squidpy.

It can be described very clearly by figure 3 from alma's paper on liver zonation

b_top: mean/sum(?) expression by distance between two discrete annotations (C/P)
b_bottom: "differential" log mean/sum(?) expression by distance between two discrete annotations (CV/PV)
d: mean/sum(?) expression by distance from a discrete annotation (essentially what we do in sq.gr.co_occurrence but continuous

Implementation details:

I wonder if this could be a single squidpy function with different behaviour? or should it be separate?
this function could just stay in squidpy.pl as there is not much computation going on (unless you disagree) or not?

almaan commented 3 years ago

Hello @giovp! Cool to see that you're working on adding this feature, I think a lot of people might be interested in it! The liver zonation patterns are a really good medium to illustrate this concept of "feature by distance", we are just about to submit the revisions for this paper, but the data is already available at the github repo: https://github.com/almaan/ST-mLiver if you want to use it as an example.

In said repo, all the functions related to this analysis can be found in a Python package called hepaquery, though this project was initialized before I knew of squidpy so it's more built on custom data classes that I wrote from scratch, but if you want inspo, the visual.py file should contain the code you're looking for.

To perhaps add some explanation to the images:

b_top : (ignore the p's and c's they are a bit confusing imo), the curves in the top row shows the expression as a function of the distance - for a set of portal vein marker genes - to any portal vein (blue) and any central vein (red). The curves in the bottom display the same information but for central vein marker genes. I say "any" here because each tissue section had multiple veins. Also, it's actually not the mean or sum but rather a smoothed approximation of the function, for this I used loess from LINK. I prefer to plot the true data observations in the background, but my co-author thought this looked a bit messy.
b_bottom: These are conceptually similar to the b_top ones, but rather than distance to either central or portal veins, we used the log ratio between the to account for influence of either vein type. This has now been exchanged for a bivariate model, which better captures the vein synergies and are more informative, so I would not focus on these. I actually used a parametric model here (multivariate linear model y ~ b0 + b1*d_c + b2_d_p, for easy model testing) but one could also use a 2D KDE (https://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation) if a non-parametric model is desirable.
d_top/bottom : contains the same information as b_top, but just not joined into the same plot. It's sort of repetitive information, but is placed there (I believe) to put context into the expression-based vein type classification we do.

My advice would perhaps be to implement a standalone function to measure the distance between every spot and a given reference. If the reference consists of multiple spots (e.g., a cluster), or several pixels (e.g., vein annotation) that aren't necessarily spatially connected, I defined the distance between a spot and the reference as the minimal distance to any spot/pixel contained with said reference. The distances are not only useful for visualization - we used them in the classifier to build "neighborhoods" around each vein. These distances could maybe go in the obs slot?

Then, perhaps some plot functions, for easy visualization (especially if you want to apply smoothing) could be useful! Of course, these are just suggestions! But happy to discuss and contribute if there's anything I can do! :)

jwrth commented 3 years ago

Hi @almaan, first of all great that you published already the repo on the analyses! I'm also working on spatial data from the liver and ended up with very similar type plots to look into the gene expression between along the CV/PV axis. Since I am a biologist I cannot comment so much on the mathematical points but did I understand that correctly that you used the multivariate linear model to determine the position of PV/CV based on the marker gene expression? And you suggest to that one could use the KDE to do the same, right?

I think in general that might be something interesting to add in squidpy. To have a set of biological structures with corresponding marker genes and find the positions of these structures based on the marker gene expression. This could be interesting also for other organs (e.g. to identify islets in pancreas/glomeruli in kidney or accumulation of cancer cells).

For visualizations I have following suggestions:

A function similar to the sq.pl.co_occurrence function that plots just the gene expression as a function of the distance to reference spots as Alma mentioned it. (Maybe with the possibility to plot the expression of multiple genes in one plot.)
A stacked version of sq.pl.co_occurrence to display e.g. the change of cell type compositions along a distance axis. I found this informative after deconvolution. Something like this:

If I have any other ideas, I'll let you guys know :)

almaan commented 3 years ago

Hi @jwrth,

first of all great that you published already the repo on the analyses! I'm also working on spatial data from the liver and ended up with very similar type plots to look into the gene expression between along the CV/PV axis.

Glad you found it useful, if your interested in the workflow of the analysis, you can check out the notebooks that I put up there, that's less code and more results as well. Also, fun to see another liver person - although I'm only responsible for designing the analysis methods and not very well read-up on the biology.

[..] but did I understand that correctly that you used the multivariate linear model to determine the position of PV/CV based on the marker gene expression?

Hmm, not quite - but I was a bit unclear so sorry for that. For the univariate case I used a (b_top) I employed a scatterplot smoothing strategy known as loess(non-parameteric) to estimate the expression levels. We also - not included in the preprint but in a revised version - constructed a bivariate model (linear, parametric) to model the influence of both portal and central veins on the expression. Also, we actually didn't use marker genes to identify the veins, but had a liver expert mark up the images with different colors (representing each vein type) which I then used as a reference to calculate these distances.

To me both suggestion 1 and 2 makes sense. I'm guessing you would then use the average perhaps, rather than a smoothed version, which I think is fine but could perhaps be a bit less robust to outliers, especially at smaller distances where the distances where fewer spots are present!

jwrth commented 3 years ago

Hmm, not quite - but I was a bit unclear so sorry for that. For the univariate case I used a (b_top) I employed a scatterplot smoothing strategy known as loess(non-parameteric) to estimate the expression levels. We also - not included in the preprint but in a revised version - constructed a bivariate model (linear, parametric) to model the influence of both portal and central veins on the expression Thanks for the further explanation, now I got it! This helps me a lot... I'll definitely look into the loess function! :)

Also, we actually didn't use marker genes to identify the veins, but had a liver expert mark up the images with different colors (representing each vein type) which I then used as a reference to calculate these distances.

Ok, so the position of the veins was manually annotated by an expert and then you used the gene expression to classify the veins as either PV or CV? Is that correct?

I'm guessing you would then use the average perhaps, rather than a smoothed version, which I think is fine but could perhaps be a bit less robust to outliers, especially at smaller distances where the distances where fewer spots are present!

In my cases I binned the distance and averaged for each bin. But I did this only because I didn't know of any other methods. 😄

giovp commented 2 years ago

I think a first version is doable for 1.2 we can then extend it with added functionality if needed.

giovp commented 1 year ago

this is happening #591

LLehner commented 1 year ago

This functionality has been added with #591.

LLehner commented 1 year ago

Hi @jwrth and @almaan the squidpy plotting module now allows to plot gene expression (or any other variable) to be plotted against the distance to a user-defined anchor point. You can also specify covariates by which the expression trends are split.

In these example images, a small pre-processed data (initially from Hartmann et al) was used, which can be accessed with squidpy.

var_by_distance_single_anchor_and_gene

var_by_distance_single_anchor_one_gene_two_categories_without_scatter

jwrth commented 1 year ago

Great! Thank you!! What's the name of the function?

LLehner commented 1 year ago

squidpy.tl.var_by_distance() for calculating the distances to an anchor point and squidpy.pl.var_by_distance() for visualization. And example will be made available soon with the next squidpy release.

giovp commented 1 year ago

it is now in new release https://squidpy.readthedocs.io/en/latest/notebooks/examples/tools/compute_var_by_distance.html

scverse / squidpy

Visualization gene expression by distance #349

Type of the feature

Description