gustaveroussy / sopa

Technology-invariant pipeline for spatial omics analysis (Xenium / Visium HD / MERSCOPE / CosMx / PhenoCycler / MACSima / ...) that scales to millions of cells
https://gustaveroussy.github.io/sopa/
BSD 3-Clause "New" or "Revised" License
123 stars 14 forks source link

Advice on simple IF hystocytometry #81

Closed cstrlln closed 3 months ago

cstrlln commented 3 months ago

Similar I guess to issue #78, I'm trying to use sopa to do histocymetry-like protocol with a one-round classic IF tissue staining. I have confocal image with 4 channels, I want to use cellpose with one channel (e.g. CH3) and would like to get the average intensity info of the other channels based on that segmentation. Then use the different levels of fluorescence to assign cell labels, cells will have combinations of CH1+CH2 or CH4+CH2 or single CH4 and CH1 with no CH2 intensity.

What would your workflow advice on going through this?

I tested the following code, but I guess the way I'm doing it is trying to look for cells with either of the markers and not taking into account double positives.

I'm still troubleshooting the min_area in the segmentation method, do you have a recommendation for this based on the diameter passed to cellpose?

import spatialdata
import sopa.segmentation
import sopa.io
import aicsimageio

sdata = sopa.io.aicsimageio("data_if/Vh169 7dpt_4 spleen1 30x stitch.oir", z_stack=0, image_models_kwargs=None, aics_kwargs=None)

sdata

SpatialData object with: └── Images └── 'image': MultiscaleSpatialImage[cyx] (4, 2947, 3921), (4, 1473, 1960), (4, 736, 980), (4, 368, 490), (4, 184, 245) with coordinate systems: ▸ 'global', with elements: image (Images)

image_key = "image"

patches = sopa.segmentation.Patches2D(sdata, image_key, patch_width=1200, patch_overlap=50)
patches.write();

[INFO] (sopa.segmentation.patching) 12 patches were saved in sdata['sopa_patches']

from sopa._sdata import get_spatial_image

print(get_spatial_image(sdata, image_key).c.values)

['CH1' 'CH2' 'CH3' 'CH4']

channels = ["CH3"]

method = sopa.segmentation.methods.cellpose_patch(diameter=7, channels=channels, flow_threshold=0.8, cellprob_threshold=-12,  pretrained_model= "model_if/CP_20240502_095143")
segmentation = sopa.segmentation.StainingSegmentation(sdata, method, channels, min_area=2500)

# The cellpose boundaries will be temporary saved here. You can choose a different path
cellpose_temp_dir = "tuto.zarr/.sopa_cache/cellpose"

# parallelize this for loop yourself (or use the Snakemake pipeline)
for patch_index in range(len(sdata['sopa_patches'])):
    segmentation.write_patch_cells(cellpose_temp_dir, patch_index)

[INFO] (sopa.segmentation.shapes) Percentage of non-geometrized cells: 0.08% (usually due to segmentation artefacts) [INFO] (sopa.segmentation.shapes) Percentage of non-geometrized cells: 0.00% (usually due to segmentation artefacts) [INFO] (sopa.segmentation.shapes) Percentage of non-geometrized cells: 0.00% (usually due to segmentation artefacts) [INFO] (sopa.segmentation.shapes) Percentage of non-geometrized cells: 0.12% (usually due to segmentation artefacts) [INFO] (sopa.segmentation.shapes) Percentage of non-geometrized cells: 0.07% (usually due to segmentation artefacts) [INFO] (sopa.segmentation.shapes) Percentage of non-geometrized cells: 0.00% (usually due to segmentation artefacts)

cells = sopa.segmentation.StainingSegmentation.read_patches_cells(cellpose_temp_dir)
cells = sopa.segmentation.shapes.solve_conflicts(cells)

shapes_key = "cellpose_boundaries" # name of the key given to the cells in sdata.shapes

sopa.segmentation.StainingSegmentation.add_shapes(sdata, cells, image_key, shapes_key)

Reading patches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 251.48it/s] [INFO] (sopa.segmentation.stainings) Found 11295 total cells Resolving conflicts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2066/2066 [00:00<00:00, 6959.47it/s] [INFO] (sopa.segmentation.stainings) Added 10459 cell boundaries in sdata['cellpose_boundaries']

aggregator = sopa.segmentation.Aggregator(sdata, image_key=image_key, shapes_key=shapes_key)

aggregator.compute_table(average_intensities=True)

NFO] (sopa.segmentation.aggregate) Averaging channels intensity over 10459 cells with expansion 0.0 [########################################] | 100% Completed | 1.52 sms /Users/carlos/mambaforge/envs/sopa/lib/python3.10/site-packages/spatialdata/_core/_elements.py:92: UserWarning: Key cellpose_boundaries already exists. Overwriting it. self._check_key(key, self.keys(), self._shared_keys)

sdata

SpatialData object with: ├── Images │ └── 'image': MultiscaleSpatialImage[cyx] (4, 2947, 3921), (4, 1473, 1960), (4, 736, 980), (4, 368, 490), (4, 184, 245) ├── Shapes │ ├── 'cellpose_boundaries': GeoDataFrame shape: (10459, 1) (2D shapes) │ └── 'sopa_patches': GeoDataFrame shape: (6, 3) (2D shapes) └── Tables └── 'table': AnnData (10459, 4) with coordinate systems: ▸ 'global', with elements: image (Images), cellpose_boundaries (Shapes), sopa_patches (Shapes)

from sopa.annotation import higher_z_score

marker_cell_dict = {
    "CH1": "marker-1",
    "CH2": "marker-2",
    "CH3": "segment_marker",
    "CH4": "marker-3"
}

higher_z_score(sdata.tables["table"], marker_cell_dict)

[INFO] (sopa.annotation.fluorescence) Annotation counts: cell_type marker-1 3794 segment_marker 2617 marker-3 2218 marker-2 1830 Name: count, dtype: int64

sopa.io.write_report("report.html", sdata)

[INFO] (sopa.io.report.generate) Writing general_section [INFO] (sopa.io.report.generate) Writing cell_section [INFO] (sopa.io.report.generate) Writing channel_section [INFO] (sopa.io.report.generate) Writing transcripts_section [INFO] (sopa.io.report.generate) Writing representation_section [INFO] (sopa.io.report.generate) Computing UMAP on 10459 cells OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead. /Users/carlos/mambaforge/envs/sopa/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm [INFO] (sopa.io.report.generate) Writing report to report.html

sdata.tables["table"]

AnnData object with n_obs × n_vars = 10459 × 4 obs: 'region', 'slide', 'cell_id', 'area', 'cell_type' uns: 'sopa_attrs', 'spatialdata_attrs', 'pca', 'neighbors', 'umap', 'cell_type_colors' obsm: 'spatial', 'z_scores', 'X_pca', 'X_umap' varm: 'PCs' obsp: 'distances', 'connectivities'

quentinblampey commented 3 months ago

Hello @cstrlln, thanks for the very detailed issue, your code is looking good!

Indeed the higher_z_score function is very simple, and does not take into account double positives. Still, after running this function, it will save z-scores for each marker into adata.obsm["z_scores"], and you can probably use this with a simple rule to annotate your cell-types (by saying that the marker is positive if z_score > 0). You can also directly run sopa.annotate.preprocess_fluo to directly get the z-scores. What I just described is a very simple approach, but it might work. If it doesn't look convincing, you can try more complex approaches such as using scyan (in that case, since you have imaging data, I advise to set prior_std=0.8 or prior_std=1 when initializing the model).

Regarding the min_area variable, I think that you can start with something like diameter ** 2 / 2 or diameter ** 2 / 3. I may add this as a default value in the future

cstrlln commented 3 months ago

Thanks! Interesting, I will look into scyan. I also use spectral flow so could be pretty helpful.

I got into an error when trying to do sdata.write:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[20], line 1
----> 1 sdata.write("sdata2.zarr")

File ~/mambaforge/envs/sopa/lib/python3.10/site-packages/spatialdata/_core/spatialdata.py:1111, in SpatialData.write(self, file_path, storage_options, overwrite, consolidate_metadata)
   1109 except Exception as e:  # noqa: B902
   1110     self._path = None
-> 1111     raise e
   1113 if consolidate_metadata:
   1114     # consolidate metadata to more easily support remote reading
   1115     # bug in zarr, 'zmetadata' is written instead of '.zmetadata'
   1116     # see discussion https://github.com/zarr-developers/zarr-python/issues/1121
   1117     zarr.consolidate_metadata(store, metadata_key=".zmetadata")

File ~/mambaforge/envs/sopa/lib/python3.10/site-packages/spatialdata/_core/spatialdata.py:1040, in SpatialData.write(self, file_path, storage_options, overwrite, consolidate_metadata)
   1038     for name in keys:
   1039         elem_group = self._init_add_element(name=name, element_type="images", overwrite=overwrite)
-> 1040         write_image(
   1041             image=self.images[name],
   1042             group=elem_group,
   1043             name=name,
   1044             storage_options=storage_options,
   1045         )
   1047         # TODO(giovp): fix or remove
   1048         # reload the image from the Zarr storage so that now the element is lazy loaded,
   1049         # and most importantly, from the correct storage
   1050         # element_path = Path(self.path) / "images" / name
   1051         # _read_multiscale(element_path, raster_type="image")
   1053 if len(self.labels):

File ~/mambaforge/envs/sopa/lib/python3.10/site-packages/spatialdata/_io/io_raster.py:206, in write_image(image, group, name, fmt, storage_options, **metadata)
    198 def write_image(
    199     image: Union[SpatialImage, MultiscaleSpatialImage],
    200     group: zarr.Group,
   (...)
    204     **metadata: Union[str, JSONDict, list[JSONDict]],
    205 ) -> None:
--> 206     _write_raster(
    207         raster_type="image",
    208         raster_data=image,
    209         group=group,
    210         name=name,
    211         fmt=fmt,
    212         storage_options=storage_options,
    213         **metadata,
    214     )

File ~/mambaforge/envs/sopa/lib/python3.10/site-packages/spatialdata/_io/io_raster.py:166, in _write_raster(raster_type, raster_data, group, name, fmt, storage_options, label_metadata, channels_metadata, **metadata)
    162     overwrite_coordinate_transformations_raster(
    163         group=_get_group_for_writing_transformations(), transformations=transformations, axes=input_axes
    164     )
    165 elif isinstance(raster_data, MultiscaleSpatialImage):
--> 166     data = _iter_multiscale(raster_data, "data")
    167     list_of_input_axes: list[Any] = _iter_multiscale(raster_data, "dims")
    168     assert len(set(list_of_input_axes)) == 1

File ~/mambaforge/envs/sopa/lib/python3.10/site-packages/spatialdata/_io/_utils.py:139, in _iter_multiscale(data, attr)
    137     names: set[str] = variables.difference({"c", "z", "y", "x"})
    138     if len(names) != 1:
--> 139         raise ValueError(f"Invalid variable name: `{names}`.")
    140 name: str = next(iter(names))
    141 if attr is not None:

ValueError: Invalid variable name: `{'Z', 'image'}`.

I'm not that familiar with python structures but after checking with a chatbot, it looks like the issue was the coordinate being labelled as 'Z' and the function was looking for 'z'?

This is what the data looked like

print(sdata.images.keys())

KeysView({'image': DataTree('None', parent=None) ├── DataTree('scale0') │ Dimensions: (c: 4, y: 2947, x: 3921) │ Coordinates: │ Z float64 0.0 │ c (c) <U3 'CH1' 'CH2' 'CH3' 'CH4' │ y (y) float64 0.5 1.5 2.5 3.5 ... 2.944e+03 2.946e+03 2.946e+03 │ x (x) float64 0.5 1.5 2.5 3.5 ... 3.918e+03 3.92e+03 3.92e+03 │ Data variables: │ image (c, y, x) uint16 dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> ├── DataTree('scale1') │ Dimensions: (c: 4, y: 1473, x: 1960) │ Coordinates: │ Z float64 0.0 │ c (c) <U3 'CH1' 'CH2' 'CH3' 'CH4' │ y (y) float64 1.0 3.001 5.002 7.002 ... 2.942e+03 2.944e+03 2.946e+03 │ x (x) float64 1.0 3.001 5.001 7.002 ... 3.916e+03 3.918e+03 3.92e+03 │ Data variables: │ image (c, y, x) uint16 dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> ├── DataTree('scale2') │ Dimensions: (c: 4, y: 736, x: 980) │ Coordinates: │ Z float64 0.0 │ c (c) <U3 'CH1' 'CH2' 'CH3' 'CH4' │ y (y) float64 2.002 6.006 10.01 ... 2.937e+03 2.941e+03 2.945e+03 │ x (x) float64 2.001 6.002 10.0 14.0 ... 3.911e+03 3.915e+03 3.919e+03 │ Data variables: │ image (c, y, x) uint16 dask.array<chunksize=(1, 736, 980), meta=np.ndarray> ├── DataTree('scale3') │ Dimensions: (c: 4, y: 368, x: 490) │ Coordinates: │ Z float64 0.0 │ c (c) <U3 'CH1' 'CH2' 'CH3' 'CH4' │ y (y) float64 4.004 12.01 20.02 ... 2.927e+03 2.935e+03 2.943e+03 │ x (x) float64 4.001 12.0 20.01 ... 3.901e+03 3.909e+03 3.917e+03 │ Data variables: │ image (c, y, x) uint16 dask.array<chunksize=(1, 368, 490), meta=np.ndarray> └── DataTree('scale4') Dimensions: (c: 4, y: 184, x: 245) Coordinates: Z float64 0.0

So I did this:

# Access the DataTree 
image_data = sdata.images['image'] 

# Iterate over the scale levels and rename the 'Z' axis at each level
for scale_level in image_data:
    image_data[scale_level] = image_data[scale_level].rename({"Z": "z"})

and then it worked.

Is this issue because the way the image was imported? or the reader? Was there a way to avoid this issue?

cstrlln commented 3 months ago

also, just noticed the z_scores are empty:

import pandas as pd
# Access the AnnData object directly
table = sdata.tables['table']  # Get the table from sdata

# Access the z_scores from the table (which is AnnData)
z_scores = table.obsm['z_scores']

# Create a DataFrame with the correct indices (from the table)
z_scores_df = pd.DataFrame(z_scores, index=table.var.index, columns=table.obs.index)

# Display the DataFrame (you might need to adjust the number of rows/columns shown)
print(z_scores_df)

     aaaaaaaa-1  aaaaaaab-1  aaaaaaac-1  aaaaaaad-1  aaaaaaae-1  aaaaaaaf-1  \
CH1         NaN         NaN         NaN         NaN         NaN         NaN   
CH2         NaN         NaN         NaN         NaN         NaN         NaN   
CH3         NaN         NaN         NaN         NaN         NaN         NaN   
CH4         NaN         NaN         NaN         NaN         NaN         NaN   

     aaaaaaag-1  aaaaaaah-1  aaaaaaai-1  aaaaaaaj-1  ...  aaaacinb-1  \
CH1         NaN         NaN         NaN         NaN  ...         NaN   
CH2         NaN         NaN         NaN         NaN  ...         NaN   
CH3         NaN         NaN         NaN         NaN  ...         NaN   
CH4         NaN         NaN         NaN         NaN  ...         NaN   

     aaaacinc-1  aaaacind-1  aaaacine-1  aaaacinf-1  aaaacing-1  aaaacinh-1  \
CH1         NaN         NaN         NaN         NaN         NaN         NaN   
CH2         NaN         NaN         NaN         NaN         NaN         NaN   
CH3         NaN         NaN         NaN         NaN         NaN         NaN   
CH4         NaN         NaN         NaN         NaN         NaN         NaN   

     aaaacini-1  aaaacinj-1  aaaacink-1  
CH1         NaN         NaN         NaN  
CH2         NaN         NaN         NaN  
CH3         NaN         NaN         NaN  
CH4         NaN         NaN         NaN  

[4 rows x 10459 columns]

EDIT: actually no worries, seems like was accessing the wrong way, this worked:

sdata.tables['table'].obsm['z_scores']
CH1 CH2 CH3 CH4
-0.584222 -0.513755 -1.071071 -0.102799
-0.489893 -0.112673 0.724328 -0.221049
-1.074451 1.332169 0.887616 1.489851
0.380193 -0.065688 -0.765778 0.580289
-0.384839 -0.579623 -0.786052 0.335864
... ... ... ...
-0.808834 -0.761074 1.571223 0.281716
-0.955882 -1.096274 0.103397 -0.734074
-0.613158 -0.906637 0.662418 -0.497469
-1.093064 -1.056596 0.369210 -0.704209
-0.691784 -1.118102 0.942739 -0.692186

10459 rows × 4 columns

still getting familiar with python

quentinblampey commented 3 months ago

Regarding the issue about the "Z" axis, this is an issue of the reader. I'll update it to drop the Z axis (since it contains only one element), or at least rename it to "z".

Concerning the z-scores, can you check that table.obsm['z_scores'] is right? I think the error came when you created a DataFrame: can you try the opposite index/columns, i.e. pd.DataFrame(z_scores, index=table.obs.index, columns=table.var.index)?

cstrlln commented 3 months ago

Yep, it was that for z_scores, the transposed values were the issue when trying to use pd.DataFrame.

How can I access the raw intensity values? I guess I would need this for scyan?

and where is the cellpose mask or segmentation stored? would like to check the segmentation more closely

quentinblampey commented 3 months ago

This depends on the technology you have:

For cellpose, we save the segmentation as polygons instead of masks, because masks can be very heavy in term of memory/storage. They can be found in the spatialdata object, at sdata["cellpose_boundaries"], which is a GeoPandas Dataframe. If you want to visualize these boundaries in Python directly, you can use spatialdata-plot

Hope this helps!

cstrlln commented 3 months ago

Great, thanks for the responses.

Just to be sure, for the case of protein only, I can take the intensities from adata.X, export this as a csv and then input this in scyan?

Would it be possible to go back to sopa once I have obtained the cell label identities? for doing spatial analysis.

I know that this is out of the scope of sopa, but figuring out how to work with protein only cases would be useful. Could you suggest an outline how to proceed?

In any case I think this closes the original issue.

quentinblampey commented 3 months ago

Actually you don't even need to export adata.X as a csv, because Scyan already works on anndata objects, so you can simply follow the Scyan tutorials on your adata object.

In particular, you can skip the "loading a FCS/CSV" section, and directly start at this preprocessing step (I think asinh normalization should be fine for your type of data, but don't hesitate to try other transformations like sc.pp.log1p or others parameters). Then, you can create a "knowledge" table, this is basically a csv that looks like this (if I understood your annotation rule correctly):


ct,CH1,CH2,CH4
A,1,1,-1
B,-1,1,1
C,1,-1,-1
D,-1,-1,1

Then you can train Scyan according to this tutorial, but since you have fluorescence data I advise to try a high prior_std when creating a model, e.g. scyan.Scyan(adata, table, prior_std=1)

I'm closing the issue, but don't hesitate to let me know and ask other questions if you need more help!

cstrlln commented 3 months ago

Thanks, I'll give this a try!

cstrlln commented 3 months ago

Sorry going back to this I noticed this was not unswered: how best to go back to sopa once cell identities are obtained.

quentinblampey commented 3 months ago

Hi, usually sopa is used as a preprocessing step, and scyan can be used afterwards for cell annotation. Normally, you don't need sopa afterwards.

Still, note that both scyan and sopa works on the same anndata object, so you can seamlessly use both libraries one after the other. Here is some pseudo-code:

# run sopa
sdata = ...

adata = sdata["table"]
table = pd.read_csv("...", index_col=0) # read your knowledge table
model = scyan.Scyan(adata, table, prior_std=1)
model.fit()
model.predict()

# now, adata is updated, as well as sdata which contains the adata object inside sdata["table"]

But if you run Scyan separately and that you saved your anndata object somewhere else, you'll need to update the table from your spatialdata object (i.e. sdata["table"] = adata, where adata is the AnnData object on which you run your annotation), and then continue with sopa.

If you need to see the cell-types from Scyan inside the Xenium Explorer, then you can use this function, or follow this tutorial.