giovp / sclecture21

Project repo for sc analysis course
MIT License
0 stars 0 forks source link

Image analysis - Philipp #3

Open giovp opened 3 years ago

giovp commented 3 years ago

Then we'll discuss next week about extracting patches from this image with our squidpy package.

@PhilippStaedter

PhilippStaedter commented 3 years ago

Hi Giovanni, I went through the scanpy tutorial you posted above, but I still have some questions regarding different steps:

  1. By using sns.distplot(), one gets a histogram e.g. for the total counts. Now, one can zoom in to determine the outlier regions by restricting e.g. the total counts. Why does the resulting zoomed-in histogram not show the same behaviour in its region as the original histogram?

  2. How do you determine where to cut off the outlier regions? In the examples, it seems like the decision is completely arbitrary.

  3. By using the function sc.pl.spatial(), one can zoom in to different regions of the spacially shown tissue. This is done by passing a parameter crop_coord which has to be a tuple. Nevertheless, I get the error message "unhashable type: index". Do you know how to fix this error? I tried to fix it by using different types than tuple but it didn't work.

  4. In the example of marker genes, it is claimed that the gene CR2 represents the spatial structure for cluster 5 the best. I don't really understand why. Is it because ist has the highest expression level in region 5? But if so, gene HMGN2 could also be the best option?

  5. During the marker gene analysis, for the data set I chose, I get no significant represnetation of the spatial structure for any of the top 10 genes. Does that mean one has to extent the acceptance of X genes, X>10, until hopefully one gene represents the spatial structure?

Thank you in advance for your help.

giovp commented 3 years ago
  1. By using sns.distplot(), one gets a histogram e.g. for the total counts. Now, one can zoom in to determine the outlier regions by restricting e.g. the total counts. Why does the resulting zoomed-in histogram not show the same behaviour in its region as the original histogram?

It's a histogram based on density, not sure what do you mean here but "same behaviour"

  1. How do you determine where to cut off the outlier regions? In the examples, it seems like the decision is completely arbitrary.

It is, most of the time it is. Basically the idea is to remove the long tails of the binomial dist. Usually it is done in an iterative fashion (e.g. fitler some, do some analysis, go back and fitler differently). Here though it doesn't matter much as the dataset is quite clean and good quality (so you could skip this step atogether)

  1. By using the function sc.pl.spatial(), one can zoom in to different regions of the spacially shown tissue. This is done by passing a parameter crop_coord which has to be a tuple. Nevertheless, I get the error message "unhashable type: index". Do you know how to fix this error? I tried to fix it by using different types than tuple but it didn't work.

This function just went through an heavy refactoriung, could you please pull the most recent version of scanpy, install it again as shown in the notes I made, and try again>. That should work.

  1. In the example of marker genes, it is claimed that the gene CR2 represents the spatial structure for cluster 5 the best. I don't really understand why. Is it because ist has the highest expression level in region 5? But if so, gene HMGN2 could also be the best option?

yes indeed, again the tutorial is just to showcase possibility. That gene was a nice candidate to show how it varies in the tissue.

  1. During the marker gene analysis, for the data set I chose, I get no significant represnetation of the spatial structure for any of the top 10 genes. Does that mean one has to extent the acceptance of X genes, X>10, until hopefully one gene represents the spatial structure?

do you mean the spatially variable genes analysis? And by no-significant what do you mean? no p-value threhsold or simply the resulted genes do not look like they are very informative? If the latter, then the method was not very robust (andd it could likely be the case since that methods is not particularly good for that purpose), I will show you an alternative approach on tuesday based on Moran's I index

All very good questions, let me know if I clarified some

PhilippStaedter commented 3 years ago

Hi Giovanni,

thank you for your answer, it did clarify a lot. Reagrding your answer to my third question, which notes are you referring to? Just so that I do not install a wrong version of scanpy where different functions do not work.

Thank you for your help.

giovp commented 3 years ago

remove the clone you have locally and re run

git clone https://github.com/theislab/scanpy.git
cd scanpy
pip install -e .
PhilippStaedter commented 3 years ago

Hi Giovanni,

I tried to remove my scanpy problem by using your suggestion to remove scanpy, clone and install it again. Unfortunately, it still does not work. I tried to somehow fix it in another way but the problem still remains.

Do you have another suggestion?

giovp commented 3 years ago

Hi @PhilippStaedter ,

is the tuple your are passing with 4 integers? are you sure you are passing the right coordinates? Would be good to post the code here

PhilippStaedter commented 3 years ago

Hi Giovanni,

the error I get for running the spatial function with the parameter crop_coord is the following:

sc.pl.spatial(adata, img_key='hires', color=f'cluster_{iRes}', size=1.5, groups=['0', '5'], crop_coord=(1200, 1700, 1900, 1000), alpha=0.5)

Traceback (most recent call last):
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-1-7d054f6d9efb>", line 1, in <module>
    sc.pl.spatial(adata, img_key='hires', color=f'cluster_{iRes}', size=1.5, groups=['0', '5'], crop_coord=(1200, 1700, 1900, 1000), alpha=0.5)
  File "/home/paulstapor/scanpy/scanpy/plotting/_tools/scatterplots.py", line 860, in spatial
    axs = embedding(
  File "/home/paulstapor/scanpy/scanpy/plotting/_tools/scatterplots.py", line 239, in embedding
    color_source_vector = _get_color_source_vector(
  File "/home/paulstapor/scanpy/scanpy/plotting/_tools/scatterplots.py", line 1092, in _get_color_source_vector
    values = values.replace(values.categories.difference(groups), np.nan)
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/pandas/core/arrays/categorical.py", line 2444, in replace
    if to_replace in cat.categories:
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3900, in __contains__
    hash(key)
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3907, in __hash__
    raise TypeError(f"unhashable type: {repr(type(self).__name__)}")
TypeError: unhashable type: 'Index'
giovp commented 3 years ago

mmh this seems like an error with the colouring, not the coords per se. Can you update matplotlib and try again? Also, to really nail this done it would be good to have a list of your env versions sc.settings.plot_header() and a reprex (minimum reproducible example). Try to nail down to a very small set of code and send it here and I'll try again! Thank you!

giovp commented 3 years ago

hi @PhilippStaedter , here's the image feature part First off, you need to download the large tissue image (just use the appropriate flag in sc.datasets.visium_sge() you might have done it already)

the large tif tissue image needs to be read in an additional object, called image container

import squidpy as sq
img = sq.im.ImageContainer("PATH TO IMAGE")
adata = sc.read("PATH TO adata") # or you can just use sc.datasets.visiumg_sge() to load adata

then, with the image container, there is a convenient function that computes a bunch of features. You can have a look at the docs, or just use these lines of python

# define different feature calculation combinations
params = {
    # all features, corresponding only to tissue underneath spot
    'features_orig': 
    {'features': 'summary',
    'size': 1,
    'scale': 1.0,
    'mask_circle': True},
    # summary and histogram features with a bit more context, original resolution
    'features_context': 
    {'features': 'summary',
    'size': 2,
    'scale': 1.0},
    # summary and histogram features with more context and at lower resolution
    'features_lowres' :
    {'features': 'summary',
    'size': 4,
    'scale': 0.25}
}

# extract features with the different parameters in a loop
for feature_name, cur_params in tqdm.tqdm(params.items()):
    # features will be saved in `adata.obsm[feature_name]`
    sq.im.calculate_image_features(adata, img, key_added=feature_name, n_jobs=4, **cur_params)

you can then combine them all together

# fill nans
adata.obsm['features_orig'].fillna(value=0, inplace=True)
# combine features in one dataframe
adata.obsm['features'] = pd.concat([adata.obsm[f] for f in params.keys()], axis='columns')
# make sure that we have no duplicated feature names in the combined table
adata.obsm['features'].columns = ad.utils.make_index_unique(adata.obsm['features'].columns)

the docstrings should have pointers to the respective functions of skimage. It's robably a good idea to understand what they do.

One analysis idea to start with is to create cluster based on iamge features (all of them, or single ones)

# helper function returning a clustering
def cluster_features(features, like=None):
    """Calculate leiden clustering of features.

    Specify filter of features using `like`.
    """
    # filter features
    if like is not None:
        features = features.filter(like=like)
    # create temporary adata to calculate the clustering
    adata = ad.AnnData(features)
    # important - feature values are not scaled, so need to scale them before PCA
    # interesting analysis: what scaling works best? use e.g. clusters in gexp as ground truth?
    sc.pp.scale(adata)
    # calculate leiden clustering
    sc.pp.pca(adata, n_comps=min(10, features.shape[1]-1))
    sc.pp.neighbors(adata)
    sc.tl.leiden(adata)

    return adata.obs['leiden']

adata.obs['features_cluster'] = cluster_features(adata.obsm['features'])

what this does is creating a clustering based on similaritis of features. You can then for isntance plot proportions of clusters in each annotated, like a confusion matrix like this image

(I have code for htis put I'll let you have fun with seaborn 😅 )

let me know for anything else, hope it's exhaustive enough

PhilippStaedter commented 3 years ago

Hi Giovanni,

I have a problem regarding the function sq.im.calculate_image_features(adata, img, key_added=feature_name, n_jobs=4, **cur_params). I get the error message with the following traceback:

File "/snap/pycharm-community/224/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_comm.py", line 290, in _on_run
    r = self.sock.recv(1024)
OSError: [Errno 9] Bad file descriptor

From my understanding of this error message using my python knowledge and an internet research, this error occurs if a file, in this case I suppose the img object, is open and then is closed via another process. The problem I have is that as in your code example there is no process which closes the file....

Do you have any idea what does go wrong here? Thank you in advance for your help.

giovp commented 3 years ago

Intereting, never got that error.

Can you make a reproducible example? difficult to debug ith only that info.

Thank you!

PhilippStaedter commented 3 years ago

Do you mean something like this?

import scanpy as sc
impoer squidpy as sq
import tqdm

adata = sc.datasets.visium_sge(sample_id='Parent_Visium_Human_BreastCancer', include_hires_tiff=True)
img = sq.im.ImageContainer('./data/Parent_Visium_Human_BreastCancer/image.tif')
adata.var_names_make_unique()
params = {
    'features_orig':
    {'features': 'summary', 'size': 1, 'scale': 1.0, 'mask_circle': True}
}
for feature_name, cur_params in tqdm.tqdm(params.items()):
    sq.im.calculate_image_features(adata, img, key_added=feature_name, n_jobs=4, **cur_params)

The last line will then return the error

giovp commented 3 years ago

thank you! I will try to reproduce the issue in the next days. For now, just generate features once and save them in anndata. Tha should work

PhilippStaedter commented 3 years ago

Hi Giovanni,

I am sorry to interrupt you again, but I am not entirely sure if I understand the last picture which you titled as "Basel clusters" correctly.

From the last step adata.obs['features_cluster'] = cluster_features(adata.obsm['features']) I get an anndata object containing the clustering of the earlier added image features based on the leiden algorithm. In your plot "Basel clusters", on the x-axis, we see different categories. My first question would be how these category names are created? Since in my anndata object, I get clusters annotated for all genes, but how can I then label them together into different categories? Do I have to know the biology behind different gene types? Also, in the picture, each tuple (category, cluster) has assigned a number e.g. (enterocyst, 3) has assgined the number 0.82. Are these numbers the scaled pca coordinates? Or do they describe the earlier used quantiles?

Thank you again in advance for your help.

giovp commented 3 years ago

right, easier if I re explain it like this you have

adata.obs["annotation1"] = [1,2,2,3,2,2]
adata.obs["annotation2"] = ["a","a","b","c","c","c"]

create a confusion matrix with percentange (number of objects) in each cluster that is also present in the other cluster.

In the heatmap above, I had two annotations and wanted to find the extent of overlap. So for instance objects that were labelled as enterocyst in Basel cluster were being annotated as 3 or 0 in the New cluster annotation.

Deos it makes sense?

PhilippStaedter commented 3 years ago

I think I get the second part about plotting a confusion matrix but I am not sure I understand the first part.

The object adata.obs in my case consists of a 4325 x 4 data frame containing all 4325 genes and the 4 columns ['in_tissue', 'array_row', 'array_column', 'features_cluster']. This would meen that the annotation2 object is one of those 4 column names and returns then a column with integers and no strings.

If you instead mean the object adata.obsm then we get based on the keyword ['features', 'features_lowres', 'features_orig', ...] a 4325 x 9 data frames with the genes x quantiles which also only returns integers.

Maybe I miss the obvious, but could you again say what you mean by adata.obs[annotation2] ?

giovp commented 3 years ago

The object adata.obs in my case consists of a 4325 x 4 data frame containing all 4325 genes and the 4 columns ['in_tissue', 'array_row', 'array_column', 'features_cluster']. This would meen that the annotation2 object is one of those 4 column names and returns then a column with integers and no strings.

no annotation2 is the annotation from the gene expression, so the output of

sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.leiden(adata, key_added="annotation2")

the idea is to compare cluster found in gnee expression space with clusters found in image feature space

PhilippStaedter commented 3 years ago

I am so sorry but I am still confused.

Before your last message, I thought that sc.tl.leiden(adata, key_added="annotation2") is actually annotation1 since it is the only time I actually get clusters for each of my genes. If this is the gene expression space clustering (which makes sense), where do we compute the image feature clustering? What is annotation1 supposed to be?

As described in my previous message, adata.obs contains of 4 columns ['in_tissue', 'array_row', 'array_column', 'features_cluster']. Here, features_cluster == annotation2, the gene expression clustering (I thought this was annotation1). And all the other three columns cannot represent image clusters because from my understanding they represent the existence and 2D positioning of the genes in the tissue.

PhilippStaedter commented 3 years ago

Hi Giovanni, as an addition to my presentation from yesterday, here are my results regarding the percentages of labeld spots for gene expression space clusters (x-axis) in image feature space clusters (y-axis).

heatmap_clusters_overlap

PhilippStaedter commented 3 years ago

Hi Giovanni, for tomorrow, I have all images but the one where the gene expression clusters are scattered all over the image feature space. Could you please explain to me again, when I have both cluster annotation for ge and if space how can I get the required image? Or could you have a look at my function if_clusters_in_ge_space() and tell my where my mistake is?

Edit: I finally found my mistake and have all the images. See you tomorrow

PhilippStaedter commented 3 years ago

Hi Giovanni, I just finished my last exam and wanted to ask what the plan is regarding the next two weeks. You mentioned using neural network analysis. Could you explain again what I need for that? Besides that I assume it is now time to start writing and selecting nice pictures for the presentation.

If you also want to talk about the neural network part over Zoom, I have time.

Thank you in advance

PhilippStaedter commented 3 years ago

Hi Giovanni,

could you send me the paper where they benchmarked different ranking tests, e.g. t-test, wilcoxon test, ... ? I would like to reference it in our report. We talked about it three weeks ago and I am not sure if you have already send me the paper.

Thank you

giovp commented 3 years ago

could you send me the paper where they benchmarked different ranking tests, e.g. t-test, wilcoxon test, ... ? I would like to reference it in our report.

nature.com/articles/nmeth.4612

I just finished my last exam and wanted to ask what the plan is regarding the next two weeks. You mentioned using neural network analysis. Could you explain again what I need for that? Besides that I assume it is now time to start writing and selecting nice pictures for the presentation.

resnet example: https://squidpy.readthedocs.io/en/latest/external_tutorials/tutorial_tf.html

image features example: https://squidpy.readthedocs.io/en/latest/auto_examples/image/compute_features.html

can you have a look at the iamge feature example and see if you can add more image features for your task (e.g. not only quantiles but also standad dev etc). no need to do the resnet example, yes good to focus on good figures project description.

PhilippStaedter commented 3 years ago

Hi Giovanni,

I looked at the link for getting more features, but I got an error message which I until now could not fix. Here a small reproducible example, maybe you can run it on your laptop.

import scanpy as sc
import squidpy as sq
import anndata as ad

adata = sc.datasets.visium_sge(sample_id='Parent_Visium_Human_BreastCancer',  include_hires_tiff=True)
adata.var_names_make_unique()
img = sq.im.ImageContainer('./data/Parent_Visium_Human_BreastCancer/image.tif')

params = {'features': 'summary', 'size': 2, 'scale': 1.0}
sq.im.calculate_image_features(adata, img, key_added='features_doublesize', n_jobs=4, **params)

In the last line in the function sq.im.calculate_image_features() the following error occurs:

Traceback (most recent call last):
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
    r = call_item()
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/paulstapor/squidpy/squidpy/im/_feature.py", line 113, in _calculate_image_features_helper
    for crop in img.generate_spot_crops(adata, obs_names=obs_ids, return_obs=False, as_array=False, **kwargs):
  File "/home/paulstapor/squidpy/squidpy/im/_container.py", line 594, in generate_spot_crops
    crop = self.crop_center(y=spatial[i][1], x=spatial[i][0], radius=radius, **kwargs)
  File "/home/paulstapor/squidpy/squidpy/im/_container.py", line 487, in crop_center
    return self.crop_corner(  # type: ignore[no-any-return]
TypeError: crop_corner() got multiple values for keyword argument 'size'

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-ce501925075a>", line 1, in <module>
    sq.im.calculate_image_features(adata_2, img, key_added='features_context', n_jobs=4, **cur_params)
  File "/home/paulstapor/squidpy/squidpy/im/_feature.py", line 86, in calculate_image_features
    res = parallelize(
  File "/home/paulstapor/squidpy/squidpy/_utils.py", line 166, in wrapper
    res = jl.Parallel(n_jobs=n_jobs, backend=backend)(
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/paulstapor/anaconda3/envs/PENV/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
TypeError: crop_corner() got multiple values for keyword argument 'size'

Yesterday and today, I tried to resolve this issue but I did not find a solution, not in the squidpy documentation and not in the tutorial. Of course I keep trying, but maybe you see directly how I have to adjust the dictionary params, so that there are not multiple values for 'size'. Also, it does work for the default values, e.g. when I do not have to set params in the function.

Thank you

giovp commented 3 years ago

can you try to