theislab / spapros

Python package for Probe set selection for targeted spatial transcriptomics.
MIT License
23 stars 6 forks source link

Check why combinatorial markers are shown for filtered cell types #205

Closed strasserle closed 1 year ago

strasserle commented 2 years ago

Describe the bug

On the small example adata, several cell types have <min_test_n celltypes, but in the masked dotplot, combinatorial markers are shown.

To Reproduce

Steps to reproduce the behavior:

  1. adata = sc.datasets.pbmc3k()
  2. sc.pp.log1p(adata_pp)
  3. selector = ProbesetSelector(adata, celltype_key = "celltype") The following celltypes' test set sizes for forest training are below min_test_n (=20): celltype_4 : 5 celltype_5 : 9 celltype_8 : 1
  4. selector.select_probeset() UserWarning: Zero cells of celltype celltype_8 in train or test set. No tree is calculated for celltype celltype_8. Zero cells of celltype celltype_8 in train or test set. Celltype celltype_8 is not included as reference celltype.
  5. selected_probeset.index[selector.probeset.selection]

package versions scanpy==1.8.1 anndata==0.7.8 umap==0.5.2 numpy==1.21.4 scipy==1.7.2 pandas==1.3.4 scikit-learn==1.0.1 statsmodels==0.13.1 python-igraph==0.9.8 pynndescent==0.5.5 spapros==0.1.0

Expected behavior

Cell types with too few celltypes (4, 5, and 8) are expected to be excluded from random forest training and thus no combinatorial markers can be found.

System:

Additional context

LouisK92 commented 1 year ago

The text was meant as a warning, cell types are not excluded, was clarified with an updated print in commit 6397f03726ef1668351ae68d5281ca4e0d90c477 (lines 463-466):

The genes selected for those cell types potentially don't generalize well. 
Find the genes for each of those cell types in self.genes_of_primary_trees after running self.select_probeset().