Open flying-sheep opened 6 years ago
yes, I know, that's non-ideal... the sparseness issue is circumvented by only returning top-scoring genes... I see that you make suggestions for how the user can get dataframes but I tend to say that he shouldn't have to do some extra work for this. i think we should continue to return a table with groups vs. top-scoring genes. this is also what all others (Seurat, Pagoda, ...) do and what, I guess, feels most intuitive. a sparse object is likely to confuse users.
if we start changing this, we should also talk to @mbuttner, who has written a function for transforming the recarrays to a single dataframe to write them to a csv or xls file and send it out to collaborators... we should also talk to @tcallies, who worked a lot on rank_genes_groups
our current workflow often involves showing collaborators tables of marker genes for different cell groups. these can get quite long as, e.g., transcription factors are not much differentially expressed, hence not top-scoring and appear further down the tabular. the tabular therefore has to be easily inspectable. currently, you can quickly turn a single rearray into a dataframe as shown here. rank_genes_groups
returns a recarray for historical reasons: there is a simple hdf5-backing via the recarray. these days, since the hdf5-backing of categorical data types within anndata works well, we could think about returning a dataframe directly. i guess this would be the way to go requiring only minor modifactions in that the hdf5-backing also accepts dataframes in .uns
and not only in .obs
and .var
.
very generally: I think that it would be a decent convention to only allow strings to denote groups/categories. this was also the convetion before using dataframes for the annotation. now we use the category dtype of pandas, which - in contrast to R - allows arbitrary data types for denoting categories. I don't see much advantage of this flexibility but we should probably stick with it. hence, another argument for simply using a dataframe and putting it in the unstructured annotation .uns
.
As an intermediate solution, we could
adata.uns["rank_genes_groups_filtered"]["names"]
has been converted into a DataFrame
rank_genes_groups
“returns” two recarrays, each with the shape #cells×#groups. one of them stores gene IDs, one the genes’ scores.the problem with this is that recarrays store their column index (names) in the dtype, in a place where only strings are accepted. however users (and indeed both our wilcoxon example and the tests) may choose to use numeric group IDs.
genes with score 0 are unimportant anyway, so maybe we should return sparse data, in the form of a long-form recarray with something like this shape (with
<group_by>
being therank_genes_groups
parameter of the same name):This way the three IDs can have user-defined types, and the data is easier to process via e.g.
pd.DataFrame.fromrecords(adata.obs['gene_ranking'])
The data should probably be sorted by descending z-scores by group, i.e. if it was a DataFrame:
return gene_ranking.groupby(group_by).sort_values('score')