Rethink group IDs in rank_genes_groups

flying-sheep commented 6 years ago

rank_genes_groups “returns” two recarrays, each with the shape #cells×#groups. one of them stores gene IDs, one the genes’ scores.

the problem with this is that recarrays store their column index (names) in the dtype, in a place where only strings are accepted. however users (and indeed both our wilcoxon example and the tests) may choose to use numeric group IDs.

genes with score 0 are unimportant anyway, so maybe we should return sparse data, in the form of a long-form recarray with something like this shape (with <group_by> being the rank_genes_groups parameter of the same name):

obs	var		score
0	ENSGXXXX	5	9.728
…	…	…	…

This way the three IDs can have user-defined types, and the data is easier to process via e.g. pd.DataFrame.fromrecords(adata.obs['gene_ranking'])

The data should probably be sorted by descending z-scores by group, i.e. if it was a DataFrame: return gene_ranking.groupby(group_by).sort_values('score')

falexwolf commented 6 years ago

yes, I know, that's non-ideal... the sparseness issue is circumvented by only returning top-scoring genes... I see that you make suggestions for how the user can get dataframes but I tend to say that he shouldn't have to do some extra work for this. i think we should continue to return a table with groups vs. top-scoring genes. this is also what all others (Seurat, Pagoda, ...) do and what, I guess, feels most intuitive. a sparse object is likely to confuse users.

if we start changing this, we should also talk to @mbuttner, who has written a function for transforming the recarrays to a single dataframe to write them to a csv or xls file and send it out to collaborators... we should also talk to @tcallies, who worked a lot on rank_genes_groups

our current workflow often involves showing collaborators tables of marker genes for different cell groups. these can get quite long as, e.g., transcription factors are not much differentially expressed, hence not top-scoring and appear further down the tabular. the tabular therefore has to be easily inspectable. currently, you can quickly turn a single rearray into a dataframe as shown here. rank_genes_groups returns a recarray for historical reasons: there is a simple hdf5-backing via the recarray. these days, since the hdf5-backing of categorical data types within anndata works well, we could think about returning a dataframe directly. i guess this would be the way to go requiring only minor modifactions in that the hdf5-backing also accepts dataframes in .uns and not only in .obs and .var.

very generally: I think that it would be a decent convention to only allow strings to denote groups/categories. this was also the convetion before using dataframes for the annotation. now we use the category dtype of pandas, which - in contrast to R - allows arbitrary data types for denoting categories. I don't see much advantage of this flexibility but we should probably stick with it. hence, another argument for simply using a dataframe and putting it in the unstructured annotation .uns.

flying-sheep commented 9 months ago

As an intermediate solution, we could

implement https://github.com/scverse/anndata/issues/679
write recarrays as dataframes
make sure tests run successfully on anndata objects where adata.uns["rank_genes_groups_filtered"]["names"] has been converted into a DataFrame

scverse / scanpy

Rethink group IDs in rank_genes_groups #61