scverse / pytometry

Flow & mass cytometry analytics.
https://pytometry.readthedocs.io/en/latest/index.html
Apache License 2.0
42 stars 10 forks source link

Choose markers to use in FlowSOM clustering #78

Open hrj21 opened 1 month ago

hrj21 commented 1 month ago

Description of feature

Hello!

Thanks again for the excellent package; I've been using it lately with great success (and fun!). My feature request is for the ability to choose a subset of var_names to cluster the observations on. This is available in the FlowSOM R and Python packages, and is a common and useful tool to control clustering.

There are a few use cases for this:

The first is when we can partition our antigens into those that define clear lineages of cells (e.g. CD3, CD14, CD11b), and those that describe the functional state of cells (e.g. cytokines, metabolic markers). Restricting clustering to only those lineage markers sometimes gives better resolution between populations, whose activation state can then studied using the functional markers.

Secondly, we have performed studies where the question was "can marker set A be used to independently identify the same cells as identified by marker set B" (it was whether metabolic antigens only can be used to identify leucocyte populations). In this case being able to select antigens for a particular clustering model was central to the experiment.

And finally, sometimes we might just have a dud marker that either wasn't expressed or the antibody didn't work, and it simply adds noise.

If there's a convenient way to do this already, please forgive me!

Best wishes Hefin

mbuttner commented 3 weeks ago

Hi @hrj21

thank you for the praise! I am grateful for your detailed feature enhancement description and the detailed examples. Those are very helpful to understand your feature request. I usually approach the subsetting of var_names as follows, which is certainly blowing up the memory when working with large objects:

However, I can put some time aside to implement a subsetting functionality that is similar to the use_highly_variable_genes parameter in various scanpy functions. For context, when we compute a PCA on single-cell RNAseq data, we can use either all 10,000+ genes or we can select a subset of informative genes whose variability exceeds the expected noise of the data. We usually don't need that for flow or mass cytometry data, but I imagine that we can create a similar implementation here. The information of which feature was used will be encoded in the .var part of the anndata object. This way, you should be able to track which subset of features you used. I do have to mention that I have not checked whether the FlowSOM package has this functionality already.

Best, Maren