New features - Githubissues

sophietr commented 6 years ago

hi Alex here my list :) thanks a lot! and I might expand it....

'needed'

[x] concatenate mulitple data sets
[ ] scale up sc.pp.regress_out, add some function for batch correction?
[x] additional heatmap annotation for cells/genes (e.g. .smp)
[ ] aga-graph: labels in pie charts are misleading (sometimes switched), maybe better to have just a legend with the colors than labeling every node.

'nice to have'

[x] In scatterplot showing gene expression: plot points ordered by expression, i.d. cells with higher expression on top of cells with lower expression
[x] log transform (let user choose the base in sc.pp.log1p (natural log., log2 or log10)
[x] cell cycle scoring (and scoring for any other gene list)
[x] function and plots for basic qc metrics (n_counts, n_genes, CV, %drop out)
[ ] sc.pl.scatter also for genes (adata.var, e.g. to plot e.g. mean expression vs dropout rate)
[x] table for high scoring genes, in addtition to sc.pl.rank_genes_groups
[x] additional differential expression test
[x] Heatmap for genes per cluster/sample/condition (not just along a path in pseuodtime order) including custom sample and gene annotation
[ ] Maybe also an option for simplified visualization with just mean expression per cluster.

falexwolf commented 6 years ago

concatenate datasets: OK? link

>>> adata1 = AnnData(np.array([[1, 2, 3], [4, 5, 6]]),
>>>                  {'smp_names': ['s1', 's2'],
>>>                   'anno1': ['c1', 'c2']},
>>>                  {'var_names': ['a', 'b', 'c']})
>>> adata2 = AnnData(np.array([[1, 2, 3], [4, 5, 6]]),
>>>                  {'smp_names': ['s3', 's4'],
>>>                   'anno1': ['c3', 'c4']},
>>>                  {'var_names': ['b', 'c', 'd']})
>>> adata3 = AnnData(np.array([[1, 2, 3], [4, 5, 6]]),
>>>                  {'smp_names': ['s5', 's6'],
>>>                   'anno2': ['d3', 'd4']},
>>>                  {'var_names': ['b', 'c', 'd']})
>>>
>>> adata = adata1.concatenate([adata2, adata3])
>>> adata.X
[[ 2.  3.]
[ 5.  6.]
[ 1.  2.]
[ 4.  5.]
[ 1.  2.]
[ 4.  5.]]
>>> adata.smp
anno1 anno2 batch
s1    c1   NaN     0
s2    c2   NaN     0
s3    c3   NaN     1
s4    c4   NaN     1
s5   NaN    d3     2
s6   NaN    d4     2

falexwolf commented 6 years ago

recluster restricted to a subset of clusters: OK?

falexwolf commented 6 years ago

sc.pp.regress_out should be more stable now, but it's not faster yet

flying-sheep commented 6 years ago

hi, i converted the list of requests to a checkmark list.

now we can check off finished items

dpcook commented 6 years ago

Hi all!

I just wanted to jump in with @sophietr and say that implementing a cell cycle classification function like Seurat's CellCycleScoring function would be a nice addition to the preprocessing options. Would be valuable to keep an eye on in downstream exploration and could then be easily regressed out if needed.

Also, do you guys have any opinions about the inclusion of imputation/smoothing strategies? I've been messing around with including it in analysis pipelines, but still haven't really settled on when to include them. If there's interest, MAGIC seems like a great option and is currently implemented in Python.

dawe commented 6 years ago

@falexwolf where would you expect a new scoring function? Under tools (sc.tl) or preprocessing (sc.pp)? I may contribute to this cell cycle thing if you haven't already

flying-sheep commented 6 years ago

MAGIC is outperformed by both scImpute and countae, so i guess we’ll stick with one of those if we implement imputation.

falexwolf commented 6 years ago

@dawe a cell cycle scoring function would be great! everything that's a bit more extensive and non-standard should go into sc.tl, everything that's really just simple preprocessing and stats with a few lines can go to sc.pp. usually, there should be a plotting function in sc.pl that presents a canonical visualization of the annotation added in with the tool... writing a test for your function would also be great ;)

falexwolf commented 6 years ago

@dpcook @flying-sheep regarding imputation: my personal view is that nothing is settled there - I can't judge what's a good method and what not and whether one should use it at all. so for now, we wil not include any imputation method in scanpy. but it's easy to just apply any package you like to the data matrix in an AnnData object adata.X...

dawe commented 6 years ago

@falexwolf I have the functions in my scanpy branch, right now. It seems to be properly working (take a look, if you want to). I'll add the tests as soon as possibile (now getting back to "ordinary work")

hammer commented 6 years ago

@flying-sheep can you cite a reference for scImpute and countae outperforming MAGIC? I'd be curious to learn which hyperparameter optimization methods and performance measures were used in the benchmark.

flying-sheep commented 6 years ago

well, @gokceneraslan told me. Gökcen, is the preprint for countae online? It should contain what @hammer asked for, right?

gokceneraslan commented 6 years ago

Yes, it'll be out soon. It's very difficult to compare imputation/denoising methods, but we have some in the paper.

hammer commented 6 years ago

@flying-sheep @gokceneraslan great! I agree it's hard to compare these algorithms as the performance of an imputation strategy often depends on the downstream use case. I'm looking forward to checking out the countae preprint. I find the scVI benchmark of imputation methods to be useful for now.

Zethson commented 1 year ago

I'm closing this because it's a bit of a very old catch all issue. If somebody in this thread is still lacking functionality in the broader scverse ecosystem or scanpy directly, I'd encourage people to post new issues to discuss specifics.

scverse / scanpy

New features #45