scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.87k stars 594 forks source link

More hierarchical API #1739

Open gokceneraslan opened 3 years ago

gokceneraslan commented 3 years ago

Hi all,

Right now we have two layers in the scanpy API. The top layer consists of the major modules like pp,pl,tl as well as the smaller ones like queries,get,datasets. In addition, we have some useful functions directly under the scanpy package like read/read_text/read_mtx etc. It is obvious that the field is advancing and alternative/better ways to perform fundamental tasks in downstream analysis (e.g. normalization, DE tests, gene selection) are emerging and will continue to emerge. Consequently, this necessitates an expansion of the scanpy API. However, I argue that having flat top-level modules makes it difficult to extend scanpy, while maintaining a reasonable API.

Right now there are two ways to introduce new functionality (assuming that it's not something completely unrelated)

1) add a new flavor/method to an existing function (e.g. sc.pp.highly_variable_genes, sc.tl.rank_genes_groups) or

2) add a new function with a shared prefix e.g. sc.pp.neighbors_tsne (see https://github.com/theislab/scanpy/pull/1561) or sc.pp.normalize_pearson_residuals (see https://github.com/berenslab/umi-normalization/issues/1) or sc.pp.normalize_pearson_residuals_pca() (see #1715 ).

Since option 1 is more complicated in terms of managing the arguments (esp. method-specific ones), I believe we tend to switch to option 2 now. But given that we already have many functions with common prefixes and that shifting towards option 2 will likely introduce more functions with long underscored names, top layers will get even flatter and wider. Therefore, I think it's time to consider a third option which is to add another layer which makes the API a tiny bit more hierarchical.

Some examples I can think of are:

sc.read.{adata,csv,text,mtx,excel,loom,h5_10x,mtx_10x,...}
sc.pp.neighbors.{umap,gauss,rapids,tsne}
sc.pp.hvg.{seurat,seurat_v3,dispersion}
sc.pp.norm.{tpm,pearson}
sc.pp.filter.{genes,cells,rank_genes,...}
sc.tl.rank_genes.{logreg,wilcoxon,ttest}
sc.tl.cluster.{leiden,louvain}
sc.tl.score.{genes,cell_cycle}
sc.pl.rank_genes.{dotplot,matrixplot,...}
sc.pl.groups.{dot,matrix,violin,...}
sc.pl.embed.{umap,tsne,pca,...}

There are a few issues I can think of

  1. I can imagine some resistance from some developers due to losing a few milliseconds by typing more characters 😄 but if you imagine the long term effects of option 2, I think this might save you some time 😛

  2. What happens to the functions that do not fit in this scheme like sc.pp.combat, sc.tl.ingest/dpt/paga/etc, sc.pl.* (maybe plotting functions with groupby argument can be under sc.pl.groups.*) ? I am not entirely sure, one option is to keep them as is, and another is to make "singular" modules for them so that everything is placed in a third layer.

  3. It will be harder to specify the "default" (i.e. somewhat recommended) method with this scheme. What I mean by that is that when we add a new flavor/method to an existing function, we can still have a default method (e.g. highly_variable_genes(flavor='seurat')) which makes things easier for the new users but here there is no obvious solution to that.

What do you think?

LuckyMD commented 3 years ago

This sounds interesting, and definitely makes things more clean in the long run... but a big issue I think would be backward compatibility for everything that relies on Scanpy. Also, I wonder if this makes it a bit more difficult for new users as they would need to know what steps are required in a single-cell analysis pipeline to understand the organization.