Open ivirshup opened 2 years ago
@Intron7 said he had experience with this and it’s a really good way to do things fast with dask etc.
@Intron7 still waiting for your comment here!
Yes, we already have a good mask for sparse scaling. Boolean arrays are very effective for indicating where computations should be performed, as they eliminate the need for copying and reintegration.
One clear example is the tl.score_genes
function. masks there as booleans for the nanmean is a lot more efficent but less pythonic
Where does a mask make sense:
Function | Should have mask | Has mask | Notes |
---|---|---|---|
pp.calculate_qc_metrics |
N | N | |
pp.filter_cells /pp.filter_genes |
N | N | returns mask |
pp.highly_variable_genes |
N | N | returns mask |
pp.log1p |
Maybe | N | because scale has it? |
pp.normalize_total |
Maybe | N | because scale has it? |
pp.regress_out |
N | N | |
pp.scale |
Y | Y | |
pp.subsample |
Y | N | maybe as weighting [^1]also add axis arg |
pp.downsample_counts |
N | N | |
pp.recipe_* |
N | N | |
pp.combat |
Y | N | |
pp.scrublet /pp.scrublet_* |
Maybe | N | maybe gene space |
pp.neighbors |
Maybe | N | creates obsp entry in relation to subset, like .fit(X).transform(Y) |
pp.pca |
Y | Y | has mask_var |
tl.tnse /tl.umap /tl.diffmap /tl.draw_graph |
Maybe | N | creates obsm entry maybe as data source instead of n_pcs |
tl.embedding_density |
N | N | |
tl.leiden /tl.louvain |
Y | Y | as restrict_to [^2] |
tl.dendrogram |
Y | N | probably useful to make dendrograms for multiple subsets? |
tl.dpt |
Maybe | N | Ask people? |
tl.paga |
N | N | |
tl.ingest |
Maybe | N | might be useful to use only part as reference? maybe not. |
tl.rank_genes_groups |
N | N | groups +reference fullfills that purpose |
tl.filter_rank_genes_groups |
N | N | creates a mask |
tl.marker_gene_overlap |
N | N | |
tl.score_genes |
N | N | uses it internally, so probably don’t expose but refactor to use it[^3] |
tl.score_genes_cell_cycle |
N | N | |
get.aggregate |
Y | Y |
[^1]: column types: bool: subset; numeric: rel. weight per observer; cat: biased sampling
[^2]: unify restrict_to
and mask
?
[^3]: todo: refactor score_genes
I think we should introduce a standardized “mask” argument to scanpy functions. This would be a boolean array (or reference to a boolean array in
obs
/var
) which masks out certain data entries.This can be thought of as a generalization of how highly variable genes is handled. As an example:
Would be equivalent to:
One of the big advantages of making this more widespread is that tasks which previously required using
.raw
or creating new anndata objects will be much easierSome uses for this change:
Plotting
A big one is plotting. Right now if you want to show gene expression for a subset of cells, you have to manually work with the Matplotlib Axes:
If a user could provide a mask, this could be reduced, and would make plotting more than one value possible:
Other uses
This has come up before in a few contexts:
Implementation
I think this could fit quite well into the
sc.get
getter/ validation functions (https://github.com/scverse/scanpy/issues/828#issuecomment-560072919).