Open hrj21 opened 1 month ago
Hi @hrj21
thank you for the praise!
I am grateful for your detailed feature enhancement description and the detailed examples. Those are very helpful to understand your feature request. I usually approach the subsetting of var_names
as follows, which is certainly blowing up the memory when working with large objects:
However, I can put some time aside to implement a subsetting functionality that is similar to the use_highly_variable_genes
parameter in various scanpy
functions. For context, when we compute a PCA on single-cell RNAseq data, we can use either all 10,000+ genes or we can select a subset of informative genes whose variability exceeds the expected noise of the data. We usually don't need that for flow or mass cytometry data, but I imagine that we can create a similar implementation here. The information of which feature was used will be encoded in the .var
part of the anndata
object. This way, you should be able to track which subset of features you used.
I do have to mention that I have not checked whether the FlowSOM package has this functionality already.
Best, Maren
Description of feature
Hello!
Thanks again for the excellent package; I've been using it lately with great success (and fun!). My feature request is for the ability to choose a subset of
var_names
to cluster the observations on. This is available in the FlowSOM R and Python packages, and is a common and useful tool to control clustering.There are a few use cases for this:
The first is when we can partition our antigens into those that define clear lineages of cells (e.g. CD3, CD14, CD11b), and those that describe the functional state of cells (e.g. cytokines, metabolic markers). Restricting clustering to only those lineage markers sometimes gives better resolution between populations, whose activation state can then studied using the functional markers.
Secondly, we have performed studies where the question was "can marker set A be used to independently identify the same cells as identified by marker set B" (it was whether metabolic antigens only can be used to identify leucocyte populations). In this case being able to select antigens for a particular clustering model was central to the experiment.
And finally, sometimes we might just have a dud marker that either wasn't expressed or the antibody didn't work, and it simply adds noise.
If there's a convenient way to do this already, please forgive me!
Best wishes Hefin