openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
314 stars 78 forks source link

[spatial decomposition] synthetic data generation #259

Closed giovp closed 1 year ago

giovp commented 3 years ago

Here to discuss the data generation step, which will be added in datasets

https://github.com/giovp/SingleCellOpenProblems/tree/master/openproblems/tasks/spatial_decomposition/datasets

Following previous discussion @vitkl w/ @almaan are we gonna use cell2location strategy for now? if yes, @vitkl could you get the functions of the notebook in a new PR here https://github.com/giovp/SingleCellOpenProblems and then we work from that?

LuckyMD commented 3 years ago

Would you also have a non-simulated dataset to load? I imagine this is difficult for the cell counting, but maybe you have some idea of ground truth in cell type proportion calling for visium spots? You could even use simple metrics that just assess whether spots on the border between two cell type regions are judged to contain a mixture of these cell types?

almaan commented 3 years ago

To the best of my knowledge there's no annotated Visium data sets out there, i.e., where someone has looked at each and every spot (or a subset of them) to then assign a label to them. I'm also not fully sure how feasible that is to get for most of the capture-based spatial methods, the HE-images are not optimal for cell type identification and other methods do not even provide a paired image.

Perhaps the data with the highest degree of verisimilitude would be bulk RNA-seq data where the ground truth is known (as suggested by vitali), that would allow us to evaluate the decomposition aspect. Still, to me there are at least two issues with approach;

  1. Would probably not capture the correct platform effects, which are an important feature to include. See link
  2. If methods that include spatial coordinates, e.g., enforcing spatial homogeneity or similar are developed, bulk RNA-seq would not be compatible with these

My suggestion would perhaps be to still include bulk RNA-seq, but to be clear and be transparent with its flaws as a method for evaluation of spatial decomposition.

almaan commented 3 years ago

Another question is whether we want to include more synthetic data generation strategies than just one. The one I used in the 'stereoscope' paper is extremely simple and easy to implement (basically just taking single cell data and mixing them together in various proportions). One could also envision that we include the methods proposed in RCTD, SPOTLight, etc.

Ofc, we could start with one strategy but as a long time goal it would be nice - and fair - to at least try to implement all methods, especially since this is usually what each developer tries to optimize performance on.

vitkl commented 3 years ago

@giovp I will try to add the functions tomorrow morning but for now, you can download the data from here: https://cell2location.cog.sanger.ac.uk/browser.html?shared=paper/synthetic_with_tissue_zones/

The simulation we used is not much more complex than @almaan 's - it is also combining scRNA-seq cells - but a more complex procedure is used to generate ground truth cell abundance (see description attached 2_cell2location_hyperparameters_synthetic_data.pdf, but you can also get the overview from the notebook).

cell2location paper has Visium data with the number of nuclei per location (estimated by segmentation). You can get the Visium data here and nuclei counts here:

%%capture
# download positions
! mkdir segmentation
! cd segmentation && wget https://raw.githubusercontent.com/vitkl/cell2location_paper/master/notebooks/selected_results/mouse_visium_snrna/segmentation/144600.csv
! cd segmentation && wget https://raw.githubusercontent.com/vitkl/cell2location_paper/master/notebooks/selected_results/mouse_visium_snrna/segmentation/144601.csv
! cd segmentation && wget https://raw.githubusercontent.com/vitkl/cell2location_paper/master/notebooks/selected_results/mouse_visium_snrna/segmentation/144602.csv
! cd segmentation && wget https://raw.githubusercontent.com/vitkl/cell2location_paper/master/notebooks/selected_results/mouse_visium_snrna/segmentation/144603.csv
! cd segmentation && wget https://raw.githubusercontent.com/vitkl/cell2location_paper/master/notebooks/selected_results/mouse_visium_snrna/segmentation/144604.csv

# download images of segmented nuclei (don't do it now to preserve RAM memory)
#! mkdir segmentation/color
#! cd segmentation && wget https://raw.githubusercontent.com/vitkl/cell2location_paper/master/notebooks/selected_results/mouse_visium_snrna/segmentation/color/144600_color.jpg
#! cd segmentation && wget https://raw.githubusercontent.com/vitkl/cell2location_paper/master/notebooks/selected_results/mouse_visium_snrna/segmentation/color/144601_color.jpg
#! cd segmentation && wget https://raw.githubusercontent.com/vitkl/cell2location_paper/master/notebooks/selected_results/mouse_visium_snrna/segmentation/color/144602_color.jpg
#! cd segmentation && wget https://raw.githubusercontent.com/vitkl/cell2location_paper/master/notebooks/selected_results/mouse_visium_snrna/segmentation/color/144603_color.jpg
#! cd segmentation && wget https://raw.githubusercontent.com/vitkl/cell2location_paper/master/notebooks/selected_results/mouse_visium_snrna/segmentation/color/144604_color.jpg

You can add it to anndata with cell2location results like this:

# read in positions of segmented nuclei for each sample
adata_vis.obs['image_name_i'] = [i[41:-4] for i in adata_vis.obs['image_name']]

segm_res = {}
for i in adata_vis.obs['image_name_i'].unique():

    sample_name = adata_vis.obs.loc[adata_vis.obs['image_name_i'] == i,
                                    'sample'].unique()[0]

    res = pd.read_csv('segmentation/' + i + '.csv')
    res['image_name_i'] = i
    res['sample_name'] = sample_name
    segm_res[i] = res

    # don't load images now to preserve RAM
    #path_segm = 'segmentation/color/' + i + '_color.jpg'
    #from cv2 import imread
    #full_img_segm = imread(path_segm, -1)

    #spatial_names = np.array(list(adata_vis.uns['spatial'].keys()))
    #spatial_name = spatial_names[[sample_name in i for i in spatial_names]][0]
    #adata_vis.uns['spatial'][spatial_name]['scalefactors']['tissue_lowres_scalef'] = 1
    #adata_vis.uns['spatial'][spatial_name]['images']['lowres'] = full_img_segm

segm_res = pd.concat(list(segm_res.values()), axis=0)

# filter segmented nuclei by size
segm_res = segm_res.loc[segm_res['size'] < 10**3.3,:]

def map_nuclei2spots(nuclei_df, adata, sample_ind, radius=None):

    from sklearn.neighbors import KDTree

    if radius is None:

        spatial_names = np.array(list(adata.uns['spatial'].keys()))
        spatial_name = spatial_names[[sample in i for i in spatial_names]][0]
        radius = adata.uns['spatial'][spatial_name]['scalefactors']['spot_diameter_fullres'] / 2

    # count nuclei within spots
    tree = KDTree(nuclei_df[['x', 'y']].values)

    adata.obs.loc[sample_ind, 'nuclei_count'] = tree.query_radius(adata.obsm['spatial'][sample_ind,:],
                                                                  radius, count_only=True)
    # measure averages size and std
    spotind = tree.query_radius(adata.obsm['spatial'][sample_ind,:], radius, count_only=False)
    adata.obs.loc[sample_ind, 'nuclei_size_sum'] = [nuclei_df['size'].values[spot].sum() for spot in spotind]
    adata.obs.loc[sample_ind, 'nuclei_size_mean'] = [nuclei_df['size'].values[spot].mean() for spot in spotind]
    adata.obs.loc[sample_ind, 'nuclei_size_std'] = [nuclei_df['size'].values[spot].std() for spot in spotind]

for s in adata_vis.obs['sample'].unique():

    spatial_names = np.array(list(adata_vis.uns['spatial'].keys()))
    spatial_name = spatial_names[[s in i for i in spatial_names]][0]

    map_nuclei2spots(segm_res.loc[segm_res['sample_name'].isin([str(s)]), :], 
                     adata_vis, adata_vis.obs['sample'].isin([s]), 
                     radius=adata_vis.uns['spatial'][spatial_name]['scalefactors']['spot_diameter_fullres'] / 2 * 1.1)

# compute total number of nuclei per section
segm_res['sample_name'].value_counts()
LuckyMD commented 3 years ago

My suggestion would perhaps be to still include bulk RNA-seq, but to be clear and be transparent with its flaws as a method for evaluation of spatial decomposition.

I would consider bulk deconvolution a separate task i think. Even if the methods would overlap. If a user goes to the platform and wants to find out which bulk deconvolution method looks good based on various metrics, then they may not initially look for "spatial decomposition".

I'm also a bit surprised that there is no annotated visium dataset publicly available yet. I would have assumed that a lot of people are generating these at the moment. Are even the publicly available ones from 10X not annotated?

giovp commented 3 years ago

I'm also a bit surprised that there is no annotated visium dataset publicly available yet. I would have assumed that a lot of people are generating these at the moment. Are even the publicly available ones from 10X not annotated?

We have 2 visium annotated in squidpy (same tissue, but hne and fluorescence). Can be called with

import squidpy as sq
adata = sq.datasets.visium_hne_adata() # or
adata = sq.datasets.visium_fluo_adata()

if by annotated you mean proportions, then yes that's a whole other problem.

@vitkl for now we have a baseline data generation step from @almaan but would be nice to add the GP based tissue-region generation, so whenever you have time would be happy to take a look at the code. I think that this step is quite crucial and so will require extensive testing. Still didn't get a reply from organizers on where these tests should be written, but let's keep it in mind!

LuckyMD commented 3 years ago

I initially meant spots... but that would only be useful if you use a metric along the lines of: spots that are called as having shared cell types (50-50?) should have neighbors that have higher purity of these cell types.

vitkl commented 3 years ago

Are even the publicly available ones from 10X not annotated?

@LuckyMD What do you mean by annotated?

We annotated 10X lymph node data (which can be loaded with scanpy). We annotated each location (spot) as germinal center Germinal Center zone (GC) or not. Then, from the literature, we know that 6 cell populations in scRNA data (B_Cycling, B_GC_DZ, B_GC_LZ, B_GC_prePB, FDC, T_CD4+_TfH_GC) are expected to be in the GC. You can use this to benchmark the accuracy of considered methods at detecting the presence of these 6 cell populations in locations/spots annotated as GC.

image image image

vitkl commented 3 years ago

Another aspect of synthetic data generation by combining scRNA-seq cells is to avoid baking in scRNA-seq batch effects into the simulation procedure. For this purpose might be best to use one batch or a few most similar batches.