scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.

https://scanpy.readthedocs.io

BSD 3-Clause "New" or "Revised" License

1.93k stars 604 forks source link

Geneset based clustering and CytOF Compatibility #510

Open biskra opened 5 years ago

biskra commented 5 years ago

I want to start off by mentioning how much I love scanpy. I recommend this package to everyone I know who is doing single cell sequencing. I love how PAGA and UMAP can be integrated together. PAGA makes appreciation of data topology so simple. In addition, it's so easy to do velocity analysis with the scvelo integration. Installing scanpy as well as hdf5/loom compatibility is remarkably easier on python than in R, which gives scanpy users an obvious advantage.

I've learned so much using this package and it has allowed me to display my data in a simple and intuitive way. Truly, from the bottom of my heart, thank you, this package is awesome and I deeply appreciate all the work the scanpy team and Theis lab has put into it.

With that said, I find myself wanting to do things with scanpy that aren't currently supported.

(1) Clustering cells based on a restricted set of genes as opposed to principle components: I understand that this is a biased approach, but I find myself wanting to do this to explore datasets and how genes vary together. I believe it would aid in understanding data topology to reduce the dimensionality of the dataset by looking at a subset of key genes that are picked by an expert who has domain knowledge. I think this is best integrated with a separate sc.pp.neighbors-like function that uses a gene list, as opposed to PCs. Integration into downstream plotting can be done with separate functions that have a similar name (sc.tl.umap_genelist or sc.tl.umap_sparse; sc.tl.paga_genelist or sc.tl.paga_sparse).

I envision that this would also be a nice stepping stone towards multi-modal analysis with CITE-seq.

(2) I want to use PAGA and UMAP to display my CytOF data as well as integrate scRNA-seq/CytOF data together. I saw a pre-print on bioRxiv that indicated that you are working on this aspect of PAGA/scanpy. This was very exciting to me and I am glad that this is being developed. Is it possible to access the current version of this software?

I think that major partitions identified with CytOF or scRNA-seq can be linked together providing a coarse-grained mechanism to demonstrate how heterogeneity identified with each technique relates to each other based on a given experimental time point.

(3) Histogram integration in the plotting api for QC metrics would be helpful. While scatter plots and violin plots are effective, I find myself wanting to make cuts (bounds on percent mitochondrial or nGenes) in my data based off histograms.

Again, thank you so much for the amazing software!

falexwolf commented 5 years ago

Thank you for the super-kind words, @biskra.

(1) I agree it's useful! Why don't you subset adata to the genes that are interesting for you and run an embedding and clustering on them? That's definitely valid. You can keep everything else in .raw. I personally would either set the highly_variable_genes annotation to False for genes that I'm not interested in after calling pp.highly_variable_genes. As an effect, the pca will be computed on those and you can propagate this further. If you don't want a pca, you can add a gene-set representation as a representation to adata via

adata.obsm['X_geneset1`] = adata[['gene1', 'gene2', 'gene3', 'gene4']].X
sc.pp.neighbors(adata, use_rep='X_geneset1')

(2) We don't have any multi-modal features in Scanpy yet. There will very likely be some in the near future, but right now, you need to write custom code for it. Sorry.

(3) Would you make a PR about it?

Thanks for the super constructive issue!

biskra commented 5 years ago

Thank you so much for the help, @falexwolf! I will give those solutions a shot. It seems like I can apply a similar solution for CytOF data. I think with your provided solution, I can work my way to multi-modal analysis with custom code.

I will get to work on the PR!

ivirshup commented 5 years ago

@biskra Just my two cents on (3), I mostly use histograms or hex bins for viewing qc metrics. I've been meaning to write this up as a tutorial for a while, but I think you can lean on whole pyviz ecosystem pretty heavily here.

Something static and pretty:

# Joint distribution with marginal histograms:
sns.jointplot(
    x="log1p_total_counts",
    y="log1p_n_genes_by_counts",
    data=adata.obs,
    kind="hex"
)

# For a single variable
sns.distplot(adata.obs["log1p_total_counts"])

# To facet the distplot by some columns of obs, like "batch":
g = sns.FacetGrid(adata.obs, row="batch", height=1.7, aspect=3) 
g.map(sns.distplot, "pct_counts_mito") # Sadly, this doesn't work for joint grids
# I do have a PR for a ridge plot based on this in the works

If you'd like it to be interactive:

import hvplot.pandas

adata.obs.hvplot.hexbin("log1p_total_counts", "log1p_n_genes_by_counts")
adata.obs.hvplot.hist("log1p_total_counts")

# Faceting
adata.obs.hvplot.hist("pct_counts_mito", by="batch", subplots=True).cols(1)

Or, if you want to play with cool most-biggest-data toys:

adata.obs.hvplot.scatter(
    "log1p_total_counts",
    "log1p_n_genes_by_counts",
    datashade=True,
    dynspread=True
)

Olivia117 commented 5 years ago

Thanks a lot for the useful comments & I am a great admirer of scanpy, @falexwolf

I am trying to cluster cells based on specific gene sets (same as @biskra). I am working with loom files & after the data preprocessing and finding out highly variable genes, I have subgrouped the HVG(s) into five different categories by functional annotation/pathway enrichment analysis. Then I tried to subset 'adata' to the gene group I am interested in to carry out the embedding & clustering:

adata = adata[:, adata.var['highly_variable']]

From the highly variable genes, let's say I want to use Gene1, Gene2,... Gene500 for the Louvain clustering instead of the PCA. Then, I tried to do what you suggested before:

I have nothing stored under adata.obsm

adata.var['highly_variable'] = adata[['gene1', 'gene2', 'gene3', 'gene4']].X

I am getting this error despite the fact that these genes are in the HVG list:

KeyError: "None of [Index(['Map7d1', 'Ndufa2', 'Klc2', 'Slc35b2'], dtype='object')] are in the [index]"

If this works, then the community graph can be computed: sc.pp.neighbors(adata, use_rep='highly_variable')

I shall be grateful if you can help.

LuckyMD commented 5 years ago

Hi @Olivia117,

Let's see if I can help. I think there are a few misunderstandings here. It appears that you are mixing the adata.var['highly_variable'] approach with the adata.obsm['X_geneset1'] approach Alex suggested.

Firstly, there is a typo in Alex' code above. It should read:

adata.obsm['X_geneset1'] = adata[:,['gene1', 'gene2', 'gene3', 'gene4']].X
sc.pp.neighbors(adata, use_rep='X_geneset1')

I believe. Your error is due to this typo. The command is interpreting 'Map7d1' as a cell index rather than a gene index.

However, there are also a few other things.

adata.var['highly_variable'] takes a boolean list, so you should assign e.g., [True, True, False, False] if you are interested in only the first two genes out of a total of 4 genes in the dataset. This can be trivially extended to select your Gene1, Gene,... Gene500 that you are interested in. When using this approach you will need to run sc.pp.pca(adata, svd_solver='arpack', use_highly_variable=True) and sc.pp.neighbors(adata) before clustering with louvain or leiden. This approach subsets to your genes of interest, then performs PCA on this gene subset, and builds a KNN graph based on Euclidean distances in this PCA space, which is then used for clustering.
If you don't want to use the route via PCA, you need to assign to adata.obsm as Alex suggests (with my typo correction above). Even if you do not have anything in adata.obsm, it should still work. If you want to put something in adata.obsm, just run sc.pp.pca(adata, svd_solver='arpack') and you will see adata.obsm['X_pca'] appear.

Hope this helps.

Olivia117 commented 5 years ago

Hi @LuckyMD,

Many thanks for your comments.

The route via PCA followed by clustering & embedding (UMAP/tSNE) works perfectly fine for me. I have also got some interesting results from the analysis. Now, I want to try clustering cells with specific gene sets instead of the conventional dimensional reduction. Yes, I tried the following lines before:

adata.obsm['X_geneset1`] = adata[:, ['gene1', 'gene2', 'gene3', 'gene4']].X

It still says, KeyError: 'Indices "[\'Ada\', \'Mustn1\', \'Mlc1\', \'Gfra\', \'Gm765\', \'Csrp2\', \'Socs2\', \'Dnajb9\']" contain invalid observation/variables names/indices.'

All of these genes are present in my dataset. I am still trying to figure out why this is happening :/ Maybe, I will paste the short code snippet later.

P.S: Sorry for getting off the subject. Is there an alternative normalization step included apart from the log-normalization method? For example, TMM in edgeR & SCnorm- that uses quantile regression to calculate the dependence of read counts on sequencing depth for each gene (when count-depth relationship varies among genes).

LuckyMD commented 5 years ago

Are you sure that the genes are in adata.var_names in the gene symbol format that you are using to subset the object? In other words, is 'Ada' in adata.var_names True? I'd just like to check whether you don't have e.g., Ensembl IDs as your variable names by chance.

Regarding normalization... there are other normalization methods. I believe a method was recently added to scanpy to use only a particular fraction of genes to calculate size factors (avoiding genes that make up >5% of the total counts). Otherwise, we have recently compiled a best practices pipeline in the group, which uses Scran's pooling strategy to normalize the data. This is implemented in R, but can easily be used in a python-based workflow via anndata2ri. A case study using the best practices (with scran and anndata2ri) is available here.

Olivia117 commented 5 years ago

Hello @LuckyMD,

Yeah, the genes are all present in adata.var_names HVG_300419

`adata.var_names

Index(['Ada', 'Mustn1', 'Mlc1', 'Gfra1', 'Rspo1', 'Gda', 'Car8', 'Ly86', 'Kcnmb1', 'Lgals3', ... 'Fam155a', 'Rnd2', 'Snx29', 'Map7d1', 'Unc80', 'Ttc17', 'Sncb', 'Esrra', 'Enpp5', 'Kctd1'], dtype='object', length=1573) ` I copy-pasted the symbols individually & it works now! I am able to do the louvain community clustering & visualize with UMAP.

`adata.obsm['X_geneset1'] = adata[:,['Ada', 'Mustn1', 'Mlc1', 'Gfra1', 'Rspo1', 'Gda', 'Car8', 'Ly86', 'Fam155a', 'Rnd2', 'Snx29']].X sc.pp.neighbors(adata, use_rep='X_geneset1') sc.tl.umap(adata) sc.pl.umap(adata) sc.tl.louvain(adata) sc.pl.umap(adata, color= ['louvain'])

Thanks for the suggestions regarding normalization with scran & anndata2ri. I shall check it out :)

falexwolf commented 5 years ago

I believe a method was recently added to scanpy to use only a particular fraction of genes to calculate size factors (avoiding genes that make up >5% of the total counts).

Yes, it's https://scanpy.readthedocs.io/en/latest/api/scanpy.pp.normalize_total.html with param fraction=0.95.

Olivia117 commented 5 years ago

Thanks a lot, @falexwolf.

I have one more question. Suppose, I have clustered cells with a restricted gene set & arrived at 8 Louvain clusters (considering one like the example above).

adata.obsm['X_geneset1] = adata[:, ['Tmem72', 'Adgrg2', 'Scn4b', 'Ust', 'Htr1d', 'Prnp', 'Rom1', 'Gpr158', 'Robo2', 'Cntn2', 'Rab11fip5', 'Lhfpl2', 'Emb', 'Gabrg2', 'Ptk7', 'Prkar1a', 'Ehd3', 'Kitl', 'Slc6a19', 'Spry4', 'Nceh1', 'Lin7a',....,'Lpin1']].X`

Input_subgroup_030519

Now, for marker identification, I have ranked genes in each community using Logistic regression & Wilcoxon. Naturally, there can be genes with a very high score (those were not used to cluster the cells). Input_CMarkers_LogReg

Now, if I find a few genes in a cluster that acts as contradictory markers (like, the markers for vascular cells in a dendritic cluster) will I be able to eliminate contaminating cells from the group that expresses those vascular genes? To simply put it, I want to cluster genes with ['gene1', 'gene2', 'gene3', 'gene4',..'gene400'] but I don't want to include cells that express ['gene10' AND 'gene35' AND 'gene100'..etc]. Will I be able to omit all such cells & rerun the cluster? I can do one gene at a time.

ldata_modified = ldata[ldata[: , 'Rpl13'].X > 0.5, :]

falexwolf commented 5 years ago

You can construct arbitrarily complicated conditions of excluding cells by doing, e.g., subset = np.logical_and(ldata[: , 'A'].X > a, ldata[: , 'A'].X > a).

However, I would question to consider cells in a cluster that express just a couple of genes in an unexpected way as "contaminating". In the worst case, you're looking at a cluster of doublets, but then you probably need to throw away the whole cluster. In the best case, you found some interesting biology.