scverse / squidpy

Spatial Single Cell Analysis in Python
https://squidpy.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
439 stars 79 forks source link

Multiple images integration #332

Closed YubinXie closed 2 years ago

YubinXie commented 3 years ago

Type of the feature

Description

Different from scRNA data that multiple dataset can be analyzed and visualized at the same time. In imaging data, usually there will be multiple images from multiple patients/mice and there could be multiple duplicates for one case. It would be nice squidpy can account for that multiple FoV for feature enrichment and spatial analysis.

mezwick commented 3 years ago

I would add my support for this request.

So far I work around the issue by merging segmented single cell data from different images for grouped analysis, splitting to adata image subsets for individual image analysis with cluster annotations from the large group analysis, then remerging to collate seperate spatial analyses

hspitzer commented 3 years ago

Hi @YubinXie and @mezwick Thanks for your comments! I agree that it is important to consider analysing multiple images at once.

We have just finished implementing first support of image z-stacks in the ImageContainer (#329, now merged to dev). This means that images can now have dimensions x,y,z,channels. This is particularly useful if you have registered images that you would like to e.g. interactively visualise (which you can now do using Napari). Processing and feature extraction functions have been updated to also consider the additional z dimension. As of now, we are limiting the feature extraction functions to only work on one z dimension at a time, but it is quite easy to extend this with a custom feature extraction function to also use multiple z dimensions. We are now working on updating our tutorials to show how to work with these new changes (see also here: https://github.com/theislab/squidpy_notebooks/pull/64). Please let me know if this z-stack extension already helps with your particular data, or if you need additional functionalities! As we don't have a lot of 3D / z-stack spatial datasets that contain images, we are keen to hear from others what their needs are!

For spatial analysis using the spatial graph; if you have true 3D data (e.g. consecutive tissue sections), the graph functions should be able to work with 3D coordinates in adata.obsm['spatial']. Again, we will be adding an example for this very soon.

mezwick commented 3 years ago

@hspitzer thanks for the update!

I'm working mainly with imaging mass cytometry type data, where we usually have 20-50 channels per image depending on the experiment, so this should be useful there. As an addition, it would be useful to have a way to refer to these channels by channel names (for instance the isotope linked to that channel, or the marker the isotope-tag in that channel is targeted to). I'd imagine channel_names might even be stored in adata.var[channel_name]. I usually set this up for my own analysis with an xarray (perhaps there lies the answer).

A larger suggestion would be the 'multiple images' (and what i think @YubinXie was also getting at) mentioned above. For IMC and similar high plex studies where many individuals/conditions are imaged, we end up with multiple images from which we would segment cells before pooling to an adata object for 'single-cell' analysis on the whole cohort. The way squidpy is set-up at the moment (maybe i'm mistaken), it seems only a single image stack can be stored in spatial. It would be useful to store all images in a study (perhaps prohibitively large for some studies though). I don't think the answer to this problem is to add a new dimension as rarely do all images have the same x, y or even c dimensions. Perhaps they could be stored within spatial under some image_name_id type set up where that name is also matched to events in adata.obs['image_name_id'] . In that way, one could do things like refer to individual images to identify neighbours in a network, but across all your images in one go.

As i said, i'm semi-achieving this:

by merging segmented single cell data from different images for grouped analysis, splitting to adata image subsets for individual image analysis with cluster annotations from the large group analysis, then remerging to collate seperate spatial analyses

But it would be nice to have a way to store multiple images within a single squidpy adata object, and refer to them by some id.

hspitzer commented 3 years ago

Thanks for the elaborations @mezwick!

I'm working mainly with imaging mass cytometry type data, where we usually have 20-50 channels per image depending on the experiment, so this should be useful there. As an addition, it would be useful to have a way to refer to these channels by channel names (for instance the isotope linked to that channel, or the marker the isotope-tag in that channel is targeted to). I'd imagine channel_names might even be stored in adata.var[channel_name]. I usually set this up for my own analysis with an xarray (perhaps there lies the answer).

Yes, when using adata, storing channel_name in adata.var is the preferred way of doing things. #343 proposes a similar schematic for the ImageContainer - lets continue this discussion in this issue.

A larger suggestion would be the 'multiple images' (and what i think @YubinXie was also getting at) mentioned above. For IMC and similar high plex studies where many individuals/conditions are imaged, we end up with multiple images from which we would segment cells before pooling to an adata object for 'single-cell' analysis on the whole cohort. The way squidpy is set-up at the moment (maybe i'm mistaken), it seems only a single image stack can be stored in spatial. It would be useful to store all images in a study (perhaps prohibitively large for some studies though). I don't think the answer to this problem is to add a new dimension as rarely do all images have the same x, y or even c dimensions. Perhaps they could be stored within spatial under some image_name_id type set up where that name is also matched to events in adata.obs['image_name_id'] . In that way, one could do things like refer to individual images to identify neighbours in a network, but across all your images in one go.

Ok, got it! Multiple images here means several images that have non-relating x and y coordinates. We have designed the ImageContainer to only contain images with the same x,y coordinates. For utilising multiple images, we’d propose to use multiple ImageContainer. Theoretically, it should be no problem to extract “single-cell” measures from the images into the same adata object. For the z-stack extension I mention above, we are storing a library_id column in adata.obs that allows relating observations (+ spatial coordinates) to specific images. This is essentially what you mention as well with image_name_id.

In your case, however, things might be a bit tricky, as the spatial coordinates from different images can’t be related to each other - meaning that you’d always have to subset to only one library_id before calculating spatial statistics. It might be possible to make this a bit easier, by adding a library_id argument to the spatial graph function from Squidpy, which would subset the adata automatically, and save you the subsetting & merging that you currently do. Pinging @giovp here, as I am not sure how feasible / sensible such an extension is.

YubinXie commented 3 years ago

Hi everyone! Thanks @mezwick for adding more explanation here. I am also analyzing imaging mass cytometry type data and usually the case is that, we have multiple FoV from the same and different patient. The analysis is usually on multiple images and single FoV usually does not provide enough information. I believe in general, in a context of most scientific paper, the needs for merging multiple conditions/batch for analysis are huge. It would be great to have a data system for imaging built for this. Thanks!

giovp commented 3 years ago

hi everyone, sorry for jumping late on this, very interesting discussion!

I'll give my take on these two bits (imho related) @YubinXie @mezwick and I believe is somewhat related to #318 .

It might be possible to make this a bit easier, by adding a library_id argument to the spatial graph function from Squidpy, which would subset the adata automatically, and save you the subsetting & merging that you currently do. Pinging @giovp here, as I am not sure how feasible / sensible such an extension is.

The analysis is usually on multiple images and single FoV usually does not provide enough information. I believe in general, in a context of most scientific paper, the needs for merging multiple conditions/batch for analysis are huge. It would be great to have a data system for imaging built for this. Thanks!

As @hspitzer mentioned, so far the zstack is only for images with same y,x dim, in the case of different y,x(,c) then best would be to have a list of ImageContainer objects.

For the spatial graph functions (everything in squidpy.gr) this is not the case, as we treat "images" as "batches", and therefore can be stored in the same adata object. Notice that the low-res images (to be visualized statically with matplotlib) can also be stored in the same adata in adata.uns["spatial"] e.g.

adata.uns["spatial"]["img_1"] = ...
adata.uns["spatial"]["img_2"] = ...
...

however, I must say that the visualization would not work as smoothly, meaning that to visualize only one image you'd have to do this (note to self to modify this behaviour in scanpy).

sc.pl.spatial(adata[adata.obs.library_id=="img_1"], color="clusters", library_id = "img_1")

For the analysis side (again, everything in squidpy.gr), the only solution for now is to iterate over the library ids and store the results outside of anndata, see https://github.com/theislab/squidpy/issues/318#issuecomment-824319306

nhood_enrich_results = []
for batch in adata.obs.batch.unique():
    adata_copy = adata[adata.obs.batch == ... ]
    sq.gr.spatial_neighbors(adata_copy)
    res = sq.gr.nhood_enrichment(adata_copy, ..., copy=True)
    nhood_enrichment_result.append(res)

In principle, we could create a class/method that wraps this behaviour, i.e. that takes as input an Anndata and some arguments like the analysis to be performed and the analysis-specific args, and return aggregated results. something like

results = sq.gr.joint_spatial_analysis(adata, library_id = "library_id", tasks = [partial(sq.gr.spatial_neighbors), partial(sq.gr.nhood_enrichment, "clusters")], args)
results.shape
# (n_clusters, n_clusters, n_library_id)

and then users could decide how to aggregate them. But I am not super sold it is something really necessary. I wonder if maybe a tutorial showcasing this usage would be enough? Really curious to hear your inputs

YubinXie commented 3 years ago

Hi @giovp Thanks for the thoughts! If I think about the ideal situation, I would think that we have batch as required a col in obs for spatial data. And when we plot images, or do analysis, the users will have options to do on all images, or on one given batch. This is consistent with most main stream count matrix for multiple images data, where we usually pool everything together in one count matrix with batch annotated.

To link to adata, I would assume we can make batch info required in the adata if it is spatial info, and then add this 'batch' as a key in uns, and all other information as you indicated. To make things easier, the default plot, and computation can be on the first image, and the user will have the option to use specific image, and all images. For single image, the batch name will be something default and this new feature then will also work for single image case.

As for the ploting issue you mentioned, I usually plot them in a array. For example, if I have 30 images, I plot the results in a 5*6 subplots. It allows me to see all the information. I am currently putting all the adata into a list and plot one by one (actually I found a dictionary with key and adata is best). it works but not the most elegant way to do it I think.

The calculation of spatial_neighbors can be done on each image, but this loop process can be hidden in squidpy. Ideally the user only needs to give a list of batch name that they want to run.

An additional thing for this multiple image is that we could use a batch information table. For example, one could aggregate each image's immune cell count, tumor average KI67 expression et al... But of course, this is easier to do outside scanpy squidpy once the adata is mature on multiple images.

Thanks again for brainstorming and really appreciate your input. We have something to publish very soon and I found sq being great for quick spatial analysis!

giovp commented 3 years ago

hi @YubinXie ,

thanks a lot for the reply in detail. I think this is an interesting API suggestion that is worth considering. I agree that a "batch" key that would slice the anndata according to the "slides" would be very useful.

The biggest hurdle I see now is the plotting function, as we are still relying on the Scanpy one and therefore makes it a bit complicated to allow this flexibility. I would try to draft a new one in the coming future to allow for such behaviour and keep you posted here.

giovp commented 2 years ago

the plotting aspect of this will be available when #437 will be merged. I wont' link it though cause there is the larger issues on making all sq.gr methods working on multiple slides settings (probably not too difficult but for sure long).