scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.9k stars 597 forks source link

setting gene_symbol to select symbol from adata.var fails in sc.pl.umap() #455

Closed sebpott closed 5 years ago

sebpott commented 5 years ago

I would like to color the umap representation using gene expression values. For ease of use I'd like to display the Gene name instead of gene_id which are the adata.var_names in my case. Setting gene_symbols = 'Symbol' doesn't seem to work for me or I am using it the wrong way.

When running sc.pl.umap(adata, gene_symbols = 'Symbol', color = ['Tnnt2'])

I get the follwoing error message:

 ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-116-e09d49f2528c> in <module>
----> 1 sc.pl.umap(adata, gene_symbols = 'Symbol', color = ['Tnnt2'])

/anaconda3/envs/scanpy/lib/python3.6/site-packages/scanpy/plotting/tools/scatterplots.py in umap(adata, **kwargs)
     27     If `show==False` a `matplotlib.Axis` or a list of it.
     28     """
---> 29     return plot_scatter(adata, basis='umap', **kwargs)
     30 
     31 

/anaconda3/envs/scanpy/lib/python3.6/site-packages/scanpy/plotting/tools/scatterplots.py in plot_scatter(adata, color, use_raw, sort_order, edges, edges_width, edges_color, arrows, arrows_kwds, basis, groups, components, projection, color_map, palette, size, frameon, legend_fontsize, legend_fontweight, legend_loc, ncols, hspace, wspace, title, show, save, ax, return_fig, **kwargs)
    275         color_vector, categorical = _get_color_values(adata, value_to_plot,
    276                                                       groups=groups, palette=palette,
--> 277                                                       use_raw=use_raw)
    278 
    279         # check if higher value points should be plot on top

/anaconda3/envs/scanpy/lib/python3.6/site-packages/scanpy/plotting/tools/scatterplots.py in _get_color_values(adata, value_to_plot, groups, palette, use_raw)
    665         raise ValueError("The passed `color` {} is not a valid observation annotation "
    666                          "or variable name. Valid observation annotation keys are: {}"
--> 667                          .format(value_to_plot, adata.obs.columns))
    668 
    669     return color_vector, categorical

ValueError: The passed `color` Tnnt2 is not a valid observation annotation or variable name. Valid observation annotation keys are: Index(['Sample', 'n_counts', 'n_genes', 'percent_mito', 'log_counts',
       'louvain'],
      dtype='object')

adata.var contains the column "Symbol" and "Tnnt2" is present:

adata.var[adata.var['Symbol'] == 'Tnnt2']

Symbol type highly_variable means dispersions dispersions_norm
Tnnt2 protein_coding True 0.923869 4.090601 11.370244

run with: scanpy==1.3.7 anndata==0.6.17 numpy==1.14.6 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.19.1 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1

ivirshup commented 5 years ago

Hey @sebpott. That feature was implemented after the 1.3.7 release, so it should work if you use the development version or wait until the next release.

sebpott commented 5 years ago

thanks @ivirshup, will do!

LuckyMD commented 5 years ago

@ivirshup Just to clarify... would you not have to have 'Tnnt2' either as a column in .obs or in .var_names if you want to use it with the color parameter? Or does gene_symbols also allow you to look for a color parameter in the gene_symbols .var column?

ivirshup commented 5 years ago

Yes (definitely) and yes (I think)

Passing an argument for gene_symbols means that instead of searching .var_names, the column of .var whose name was passed will be searched.

For example, if you had an AnnData object adata with ensembl ids as adata.var_name, and hgnc symbols under the column adata.var[“gene_name”], the following calls should plot similar things (different titles):

sc.pl.umap(adata, color=[“ENSG00000261371”])
sc.pl.umap(adata, color=[“PECAM1”], gene_symbols=“gene_name”)
LuckyMD commented 5 years ago

That's really cool! A nice additional functionality to what was first introduced in pl.scatter.

csijcs commented 5 years ago

Hello, I'm having a bit of trouble with this. I know the issues is closed, but I thought it might be better to continue this discussion rather than start a new one, though I can do that if you prefer. I have an AnnData object adata with ensembl ids as adata.var_name and mouse gene symbols under the column adata.var[“gene_name”]. When I call: sc.pl.umap(adata, color=['ENSMUSG00000074637']) It plots no problem. However, when I call: sc.pl.umap(adata, color=['Sox2'], gene_symbol='gene_name') I get the following error:

Traceback (most recent call last):

  File "<ipython-input-559-05c51c5cc5d6>", line 1, in <module>
    sc.pl.umap(adata, color=['Sox2'], gene_symbol='gene_name')

  File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 29, in umap
    return plot_scatter(adata, basis='umap', **kwargs)

  File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 275, in plot_scatter
    use_raw=use_raw, gene_symbols=gene_symbols)

  File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 670, in _get_color_values
    .format(value_to_plot, adata.obs.columns))

ValueError: The passed `color` Sox2 is not a valid observation annotation or variable name. Valid observation annotation keys are: Index(['timepoint', 'replicate_id', 'n_genes', 'percent_mito', 'n_counts',
       'louvain'],
      dtype='object')

Inspecting adata.var["gene_name"] give:

index
ENSMUSG00000002459            Rgs20
ENSMUSG00000033740             St18
ENSMUSG00000067879    3110035E14Rik
ENSMUSG00000025912            Mybl1
ENSMUSG00000016918            Sulf1
ENSMUSG00000025938          Slco5a1
ENSMUSG00000025930              Msc
ENSMUSG00000025921            Rdh10
ENSMUSG00000025777            Gdap1
ENSMUSG00000025776         Crispld1
ENSMUSG00000025927           Tfap2b
ENSMUSG00000025931            Paqr8
ENSMUSG00000026158           Ogfrl1
...

I'm not sure what I'm doing wrong here. I can do just about anything using the ensembl ids, but I am having a lot of trouble using the gene symbols. I would like to be able to use the gene symbols in the plots for umap, violin, pca, etc. Any help would be much appreciated. Thanks!

fidelram commented 5 years ago

should be gene_symbols in plural.

On Thu, Mar 14, 2019 at 9:46 AM csijcs notifications@github.com wrote:

Hello, I'm having a bit of trouble with this. I know the issues is closed, but I thought it might be better to continue this discussion rather than start a new one, though I can do that if you prefer. I have an AnnData object adata with ensembl ids as adata.var_name and mouse gene symbols under the column adata.var[“gene_name”]. When I call: sc.pl.umap(adata, color=['ENSMUSG00000074637']) It plots no problem. However, when I call: sc.pl.umap(adata, color=['Sox2'], gene_symbol='gene_name') I get the following error:

Traceback (most recent call last):

File "", line 1, in

sc.pl.umap(adata, color=['Sox2'], gene_symbol='gene_name')

File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 29, in umap

return plot_scatter(adata, basis='umap', **kwargs)

File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 275, in plot_scatter

use_raw=use_raw, gene_symbols=gene_symbols)

File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 670, in _get_color_values

.format(value_to_plot, adata.obs.columns))

ValueError: The passed color Sox2 is not a valid observation annotation or variable name. Valid observation annotation keys are: Index(['timepoint', 'replicate_id', 'n_genes', 'percent_mito', 'n_counts',

   'louvain'],

  dtype='object')

Inspecting adata.var["gene_name"] give:

index

ENSMUSG00000002459 Rgs20

ENSMUSG00000033740 St18

ENSMUSG00000067879 3110035E14Rik

ENSMUSG00000025912 Mybl1

ENSMUSG00000016918 Sulf1

ENSMUSG00000025938 Slco5a1

ENSMUSG00000025930 Msc

ENSMUSG00000025921 Rdh10

ENSMUSG00000025777 Gdap1

ENSMUSG00000025776 Crispld1

ENSMUSG00000025927 Tfap2b

ENSMUSG00000025931 Paqr8

ENSMUSG00000026158 Ogfrl1

...

I'm not sure what I'm doing wrong here. I can do just about anything using the ensembl ids, but I am having a lot of trouble using the gene symbols. I would like to be able to use the gene symbols in the plots for umap, violin, pca, etc. Any help would be much appreciated. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/455#issuecomment-472756442, or mute the thread https://github.com/notifications/unsubscribe-auth/AEu_1RNhQOjQK5W1y5e32ifr1ltXXdmtks5vWgxygaJpZM4aczVa .

csijcs commented 5 years ago

Ok great, thanks.

Similarly, I still can't call by gene name with sc.pl.violin

sc.pl.violin(adata, ['ENSMUSG00000074637'], groupby='louvain') works, but not sc.pl.violin(adata, ['Sox2'], groupby='louvain')

Another thing that would be helpful is if sc.tl.rank_genes_groups would use gene_symbols instead of the Ensembl ID. Is it possible?

It seems like these re getting lost in the Louvain grouping, but I can't see in the documentation how to retain or recall them. Would appreciate your help. Thanks!

ivirshup commented 5 years ago

@fidelram knows more about the plotting code than me, but here's some recommendations:

csijcs commented 5 years ago

I tried to set var_names from gene_symbols, and I get a warning message: Variable names are not unique. To make them unique, call.var_names_make_unique.

In calling adata.var_names_make_unique() I get the error: TypeError: unsupported operand type(s) for +: 'float' and 'str'

I can ignore this and take it through most of the analysis and am able to make the plots and rank the genes by name, however, I am unable to save. Calling adata.write('./write/adata.h5ad') gives the following error:

  File "pandas/_libs/src/inference.pyx", line 1472, in pandas._libs.lib.map_infer

TypeError: object of type 'float' has no len()

Also, the clustering is slightly different, I'm guessing from not having unique gene names. I've looked through the documentation for sc.pl.rank_genes_groups_* and cannot figure out how to keep the index as the Ensembl gene ID and just use gene_symbols to call the plots (sc.pl.violin, etc.) and use the sc.tl.rank_genes_groups.

LuckyMD commented 5 years ago

I have noticed that clustering is always slightly different on different systems... even when using exactly the same code. This will have to do with the underlying numerical libraries I assume. I don't think that has anything to do with gene names.

falexwolf commented 5 years ago

The slight difference is due to scikit learn's implementation of PCA. If the top PCs are enough for you, you get perfect reproducibility, as in the tutorials clustering and trajectory inference. Higher PCs are subject to the instability of the underlying iterative methods for computing them. You'll always see slight inconsistencies. However, I've never seen this to affect any conclusion drawn from an analysis.

LuckyMD commented 5 years ago

I have gotten fairly different clustering results when using svd_solver='arpack' in all but 1 case actually. The biological interpretation is still roughly the same, but the depth of subclustering you can do does differ. Based on a preliminary test, using arpack for all sc.pp.pca() calls does improve the reproducibility, although clustering results still differ (tested on Fedora 25 and Fedora 28, e.g. cluster sizes change by 100-200 cells). I can show you the differences when you're around next if you like. This is definitely a discussion for a different thread though.

falexwolf commented 5 years ago

Thanks for the heads up! Still, the reason for this is not related to the clustering algorithm, but the PCA. The clustering is greedy and will show large deviations upon slight changes of the graph, but otherwise, it's deterministic. I don't know whether there is a way around it...

LuckyMD commented 5 years ago

Unless you write your own system agnostic numerical libraries or run everything in a docker container... probably not. So just to clarify... you do expect using sc.pp.pca() with svd_solver='arpack' to not be 100% reproducible across systems as well then?

beetlejuice007 commented 2 years ago

adata.var["gene_name"]

Traceback (most recent call last): File "/sc/arion/work/gujarh01/software/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'gene_name'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/sc/arion/work/gujarh01/software/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 3458, in getitem indexer = self.columns.get_loc(key) File "/sc/arion/work/gujarh01/software/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc raise KeyError(key) from err KeyError: 'gene_name'

Did something change in scanpy ?

maflot commented 2 years ago

I get an identical error as @hemantgujar