Closed sebpott closed 5 years ago
Hey @sebpott. That feature was implemented after the 1.3.7 release, so it should work if you use the development version or wait until the next release.
thanks @ivirshup, will do!
@ivirshup Just to clarify... would you not have to have 'Tnnt2'
either as a column in .obs
or in .var_names
if you want to use it with the color
parameter? Or does gene_symbols
also allow you to look for a color
parameter in the gene_symbols
.var
column?
Yes (definitely) and yes (I think)
Passing an argument for gene_symbols
means that instead of searching .var_names
, the column of .var
whose name was passed will be searched.
For example, if you had an AnnData object adata
with ensembl ids as adata.var_name
, and hgnc symbols under the column adata.var[“gene_name”]
, the following calls should plot similar things (different titles):
sc.pl.umap(adata, color=[“ENSG00000261371”])
sc.pl.umap(adata, color=[“PECAM1”], gene_symbols=“gene_name”)
That's really cool! A nice additional functionality to what was first introduced in pl.scatter
.
Hello, I'm having a bit of trouble with this. I know the issues is closed, but I thought it might be better to continue this discussion rather than start a new one, though I can do that if you prefer. I have an AnnData object adata
with ensembl ids as adata.var_name
and mouse gene symbols under the column adata.var[“gene_name”]
. When I call:
sc.pl.umap(adata, color=['ENSMUSG00000074637'])
It plots no problem. However, when I call:
sc.pl.umap(adata, color=['Sox2'], gene_symbol='gene_name')
I get the following error:
Traceback (most recent call last):
File "<ipython-input-559-05c51c5cc5d6>", line 1, in <module>
sc.pl.umap(adata, color=['Sox2'], gene_symbol='gene_name')
File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 29, in umap
return plot_scatter(adata, basis='umap', **kwargs)
File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 275, in plot_scatter
use_raw=use_raw, gene_symbols=gene_symbols)
File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 670, in _get_color_values
.format(value_to_plot, adata.obs.columns))
ValueError: The passed `color` Sox2 is not a valid observation annotation or variable name. Valid observation annotation keys are: Index(['timepoint', 'replicate_id', 'n_genes', 'percent_mito', 'n_counts',
'louvain'],
dtype='object')
Inspecting adata.var["gene_name"] give:
index
ENSMUSG00000002459 Rgs20
ENSMUSG00000033740 St18
ENSMUSG00000067879 3110035E14Rik
ENSMUSG00000025912 Mybl1
ENSMUSG00000016918 Sulf1
ENSMUSG00000025938 Slco5a1
ENSMUSG00000025930 Msc
ENSMUSG00000025921 Rdh10
ENSMUSG00000025777 Gdap1
ENSMUSG00000025776 Crispld1
ENSMUSG00000025927 Tfap2b
ENSMUSG00000025931 Paqr8
ENSMUSG00000026158 Ogfrl1
...
I'm not sure what I'm doing wrong here. I can do just about anything using the ensembl ids, but I am having a lot of trouble using the gene symbols. I would like to be able to use the gene symbols in the plots for umap, violin, pca, etc. Any help would be much appreciated. Thanks!
should be gene_symbols in plural.
On Thu, Mar 14, 2019 at 9:46 AM csijcs notifications@github.com wrote:
Hello, I'm having a bit of trouble with this. I know the issues is closed, but I thought it might be better to continue this discussion rather than start a new one, though I can do that if you prefer. I have an AnnData object adata with ensembl ids as adata.var_name and mouse gene symbols under the column adata.var[“gene_name”]. When I call: sc.pl.umap(adata, color=['ENSMUSG00000074637']) It plots no problem. However, when I call: sc.pl.umap(adata, color=['Sox2'], gene_symbol='gene_name') I get the following error:
Traceback (most recent call last):
File "
", line 1, in sc.pl.umap(adata, color=['Sox2'], gene_symbol='gene_name')
File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 29, in umap
return plot_scatter(adata, basis='umap', **kwargs)
File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 275, in plot_scatter
use_raw=use_raw, gene_symbols=gene_symbols)
File "/anaconda3/lib/python3.6/site-packages/scanpy/plotting/_tools/scatterplots.py", line 670, in _get_color_values
.format(value_to_plot, adata.obs.columns))
ValueError: The passed
color
Sox2 is not a valid observation annotation or variable name. Valid observation annotation keys are: Index(['timepoint', 'replicate_id', 'n_genes', 'percent_mito', 'n_counts','louvain'], dtype='object')
Inspecting adata.var["gene_name"] give:
index
ENSMUSG00000002459 Rgs20
ENSMUSG00000033740 St18
ENSMUSG00000067879 3110035E14Rik
ENSMUSG00000025912 Mybl1
ENSMUSG00000016918 Sulf1
ENSMUSG00000025938 Slco5a1
ENSMUSG00000025930 Msc
ENSMUSG00000025921 Rdh10
ENSMUSG00000025777 Gdap1
ENSMUSG00000025776 Crispld1
ENSMUSG00000025927 Tfap2b
ENSMUSG00000025931 Paqr8
ENSMUSG00000026158 Ogfrl1
...
I'm not sure what I'm doing wrong here. I can do just about anything using the ensembl ids, but I am having a lot of trouble using the gene symbols. I would like to be able to use the gene symbols in the plots for umap, violin, pca, etc. Any help would be much appreciated. Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/455#issuecomment-472756442, or mute the thread https://github.com/notifications/unsubscribe-auth/AEu_1RNhQOjQK5W1y5e32ifr1ltXXdmtks5vWgxygaJpZM4aczVa .
Ok great, thanks.
Similarly, I still can't call by gene name with sc.pl.violin
sc.pl.violin(adata, ['ENSMUSG00000074637'], groupby='louvain')
works, but not
sc.pl.violin(adata, ['Sox2'], groupby='louvain')
Another thing that would be helpful is if sc.tl.rank_genes_groups
would use gene_symbols instead of the Ensembl ID. Is it possible?
It seems like these re getting lost in the Louvain grouping, but I can't see in the documentation how to retain or recall them. Would appreciate your help. Thanks!
@fidelram knows more about the plotting code than me, but here's some recommendations:
rank_genes_groups
take gene_symbols
arguments. See the docs for functions like: sc.pl.rank_genes_groups_*
gene_symbols
argument is recent, and is gradually being added to functions.var_names
to be gene symbols. I prefer keeping var_names
as unique, canonical identifiers, but sometimes do this. I tried to set var_names
from gene_symbols, and I get a warning message:
Variable names are not unique. To make them unique, call
.var_names_make_unique.
In calling adata.var_names_make_unique()
I get the error:
TypeError: unsupported operand type(s) for +: 'float' and 'str'
I can ignore this and take it through most of the analysis and am able to make the plots and rank the genes by name, however, I am unable to save. Calling adata.write('./write/adata.h5ad')
gives the following error:
File "pandas/_libs/src/inference.pyx", line 1472, in pandas._libs.lib.map_infer
TypeError: object of type 'float' has no len()
Also, the clustering is slightly different, I'm guessing from not having unique gene names. I've looked through the documentation for sc.pl.rank_genes_groups_*
and cannot figure out how to keep the index as the Ensembl gene ID and just use gene_symbols to call the plots (sc.pl.violin
, etc.) and use the sc.tl.rank_genes_groups
.
I have noticed that clustering is always slightly different on different systems... even when using exactly the same code. This will have to do with the underlying numerical libraries I assume. I don't think that has anything to do with gene names.
The slight difference is due to scikit learn's implementation of PCA. If the top PCs are enough for you, you get perfect reproducibility, as in the tutorials clustering and trajectory inference. Higher PCs are subject to the instability of the underlying iterative methods for computing them. You'll always see slight inconsistencies. However, I've never seen this to affect any conclusion drawn from an analysis.
I have gotten fairly different clustering results when using svd_solver='arpack'
in all but 1 case actually. The biological interpretation is still roughly the same, but the depth of subclustering you can do does differ. Based on a preliminary test, using arpack for all sc.pp.pca()
calls does improve the reproducibility, although clustering results still differ (tested on Fedora 25 and Fedora 28, e.g. cluster sizes change by 100-200 cells). I can show you the differences when you're around next if you like. This is definitely a discussion for a different thread though.
Thanks for the heads up! Still, the reason for this is not related to the clustering algorithm, but the PCA. The clustering is greedy and will show large deviations upon slight changes of the graph, but otherwise, it's deterministic. I don't know whether there is a way around it...
Unless you write your own system agnostic numerical libraries or run everything in a docker container... probably not. So just to clarify... you do expect using sc.pp.pca()
with svd_solver='arpack'
to not be 100% reproducible across systems as well then?
adata.var["gene_name"]
Traceback (most recent call last): File "/sc/arion/work/gujarh01/software/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'gene_name'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "
Did something change in scanpy ?
I get an identical error as @hemantgujar
I would like to color the umap representation using gene expression values. For ease of use I'd like to display the Gene name instead of gene_id which are the adata.var_names in my case. Setting
gene_symbols = 'Symbol'
doesn't seem to work for me or I am using it the wrong way.When running
sc.pl.umap(adata, gene_symbols = 'Symbol', color = ['Tnnt2'])
I get the follwoing error message:
adata.var contains the column "Symbol" and "Tnnt2" is present:
adata.var[adata.var['Symbol'] == 'Tnnt2']
run with:
scanpy==1.3.7 anndata==0.6.17 numpy==1.14.6 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.19.1 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1