ratan-lab / sumo

Subtyping tool for multi-omic data
https://pypi.org/project/python-sumo
MIT License
13 stars 1 forks source link

Minimum number of repetitions required for `sumo run` #20

Open aakrosh opened 4 years ago

aakrosh commented 4 years ago

sumo run fails with the following message when a small number of repetitions (-n 2 in this case) is used.

Traceback (most recent call last):
  File "sumo/env/bin/sumo", line 11, in <module>
    load_entry_point('python-sumo', 'console_scripts', 'sumo')()
  File "sumo/src/sumo/run.py", line 15, in main
    mode.run()
  File "sumo/src/sumo/modes/run/run.py", line 150, in run
    results = [_run_factorization(sparsity=sparsity, k=k, sumo_run=_sumo_run) for sparsity in self.sparsity]
  File "sumo/src/sumo/modes/run/run.py", line 150, in <listcomp>
    results = [_run_factorization(sparsity=sparsity, k=k, sumo_run=_sumo_run) for sparsity in self.sparsity]
  File "sumo/src/sumo/modes/run/run.py", line 336, in _run_factorization
    consensus_labels = extract_ncut(consensus, k=k)
  File "sumo/src/sumo/utils.py", line 195, in extract_ncut
    u, s, vh = np.linalg.svd(np.eye(a.shape[0]) - d @ a @ d)
  File "<__array_function__ internals>", line 6, in svd
  File "sumo/env/lib/python3.6/site-packages/numpy/linalg/linalg.py", line 1626, in svd
    u, s, vh = gufunc(a, signature=signature, extobj=extobj)
ValueError: On entry to DLASCL parameter number 4 had an illegal value

Is there a minimum value that should be specified for a successful run?

sienkie commented 4 years ago

Version 0.2.5 introduced two new parameters ('-subsample' and '-rep') that increase the stability of results by exploring consensus clustering properties to a greater extent.

The former parameter regulates the fraction of samples that are randomly removed from each factorization. While deciding which samples will be removed we explicitly make sure that all samples will be clustered at least once. The later parameter sets the number of times a subset of all runs (random 80% of runs) is used to create a consensus matrix.

The above error appears when there is a sample that was not clustered in any of the runs in a subset. This is very unlikely while using default sumo parameters, as factorization is run 60 times and only 5% of samples are removed from each run.

For now, I recommend using the higher number of repetitions or setting '-subsample' parameter to 0 (which prevents encountering this issue even if -n is very small), however, this issue will have to be addressed in the future.