theislab / cellrank

CellRank: dynamics from multi-view single-cell data
https://cellrank.org
BSD 3-Clause "New" or "Revised" License
345 stars 46 forks source link

discretizing error in `compute_macrostates` #980

Closed jonas2612 closed 1 year ago

jonas2612 commented 1 year ago

Trying to run compute_macrostates. I get the error: ValueError: Discretizing leads to a cluster with0samples, less than the threshold which is1samples. Consider recomputing the fuzzy clustering. for a certain n_states parameter (in this case n_states=9) ...

Jupyter notebook producing error on request (on Helmholtz server)


<!-- Put your error output in the block below (if applicable, else delete the block): -->
```ValueError                                Traceback (most recent call last)
Cell In[37], line 1
----> 1 g.compute_macrostates(n_states=9, cluster_key='state_info', n_cells=5)
      2 scv.set_figure_params('scvelo', transparent=True, fontsize=20, color_map='viridis')
      3 g.plot_macrostates(discrete=True, basis="umap", legend_loc="right", legend_fontweight='normal', legend_fontsize='12', dpi=250)

File ~/miniconda3/envs/neuralBenchmark_moscot/lib/python3.9/site-packages/cellrank/estimators/terminal_states/_gpcca.py:213, in GPCCA.compute_macrostates(self, n_states, n_cells, cluster_key, **kwargs)
    207     logg.warning(
    208         f"Unable to compute macrostates with `n_states={n_states}` because it will "
    209         f"split complex conjugate eigenvalues. Using `n_states={n_states + 1}`"
    210     )
    211     self._gpcca = self._gpcca.optimize(m=n_states + 1)
--> 213 self._set_macrostates(
    214     memberships=self._gpcca.memberships,
    215     n_cells=n_cells,
    216     cluster_key=cluster_key,
    217     params=self._create_params(),
    218     time=start,
    219 )
    220 return self

File ~/miniconda3/envs/neuralBenchmark_moscot/lib/python3.9/site-packages/cellrank/estimators/terminal_states/_gpcca.py:986, in GPCCA._set_macrostates(self, memberships, n_cells, cluster_key, check_row_sums, time, params)
    983     logg.debug("Setting the macrostates using macrostates memberships")
    985     # select the most likely cells from each macrostate
--> 986     assignment, not_enough_cells = self._create_states(
    987         memberships,
    988         n_cells=n_cells,
    989         check_row_sums=check_row_sums,
    990         return_not_enough_cells=True,
    991     )
    993 # remove previous fields
    994 self._write_terminal_states(None, None, None, None, log=False)

File ~/miniconda3/envs/neuralBenchmark_moscot/lib/python3.9/site-packages/cellrank/estimators/terminal_states/_gpcca.py:886, in GPCCA._create_states(self, probs, n_cells, check_row_sums, return_not_enough_cells)
    883 if n_cells <= 0:
    884     raise ValueError(f"Expected `n_cells` to be positive, found `{n_cells}`.")
--> 886 discrete, not_enough_cells = _fuzzy_to_discrete(
    887     a_fuzzy=probs,
    888     n_most_likely=n_cells,
    889     remove_overlap=False,
    890     raise_threshold=0.2,
    891     check_row_sums=check_row_sums,
    892 )
    894 states = _series_from_one_hot_matrix(
    895     membership=discrete,
    896     index=self.adata.obs_names,
    897     names=probs.names if isinstance(probs, Lineage) else None,
    898 )
    900 return (states, not_enough_cells) if return_not_enough_cells else states

File ~/miniconda3/envs/neuralBenchmark_moscot/lib/python3.9/site-packages/cellrank/_utils/_utils.py:1253, in _fuzzy_to_discrete(a_fuzzy, n_most_likely, remove_overlap, raise_threshold, check_row_sums)
   1251     if (n_samples_per_cluster < n_raise).any():
   1252         min_samples = np.min(n_samples_per_cluster)
-> 1253         raise ValueError(
   1254             f"Discretizing leads to a cluster with `{min_samples}` samples, less than the threshold which is "
   1255             f"`{n_raise}` samples. Consider recomputing the fuzzy clustering."
   1256         )
   1257 if (n_samples_per_cluster > n_most_likely).any():
   1258     raise ValueError("Assigned more samples than requested.")

ValueError: Discretizing leads to a cluster with `0` samples, less than the threshold which is `1` samples. Consider recomputing the fuzzy clustering.

Versions:

1.5.1+g525b847

Marius1311 commented 1 year ago

mhm, thanks for posting. How many cells do you have in your dataset?

Marius1311 commented 1 year ago

So the problem here is that you have a large overlap between macrostates, and when you discretize to find the top N cells to represent each macrostate, one macrostate ends up with no cells (because they are all assigned to another macrostate already)

Marius1311 commented 1 year ago

This can happen if you have either too many macrostates, too few cells, or just two macrostates completely overlapping by chance. I don't think there's anything we can do about this right now.