qiime2 / q2-sample-classifier

QIIME 2 plugin for machine learning prediction of sample data.
BSD 3-Clause "New" or "Revised" License
6 stars 36 forks source link

Possible stochastic failure related to bad linkage matrix #59

Open ebolyen opened 7 years ago

ebolyen commented 7 years ago

Bug Description It appears that it is possible for the visualizer to fail when it finds a "perfectly bad" split of the test and training data.

Expected Behavior @nbokulich says that this is very unlikely to happen, but we should look into reproducing and raising a different error. If a user runs into this, they should be able to just re-run their command.

Screenshots

Test failure/traceback ``` _____________________ EstimatorsTests.test_maturity_index ______________________ self = def test_maturity_index(self): maturity_index(self.temp_dir.name, self.table_ecam_fp, self.md_ecam_fp, category='month', group_by='delivery', n_jobs=1, > control='Vaginal', test_size=0.4) test-env/lib/python3.5/site-packages/q2_sample_classifier/tests/test_classifier.py:250: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ test-env/lib/python3.5/site-packages/q2_sample_classifier/classify.py:145: in maturity_index accuracy, output_dir, maz_stats=maz_stats) test-env/lib/python3.5/site-packages/q2_sample_classifier/utilities.py:412: in _visualize_maturity_index g = _clustermap_from_dataframe(top, metadata, group_by, category) test-env/lib/python3.5/site-packages/q2_sample_classifier/visuals.py:69: in _clustermap_from_dataframe row_cluster=False) test-env/lib/python3.5/site-packages/seaborn/matrix.py:1301: in clustermap **kwargs) test-env/lib/python3.5/site-packages/seaborn/matrix.py:1131: in plot row_linkage=row_linkage, col_linkage=col_linkage) test-env/lib/python3.5/site-packages/seaborn/matrix.py:1032: in plot_dendrograms axis=1, ax=self.ax_col_dendrogram, linkage=col_linkage) test-env/lib/python3.5/site-packages/seaborn/matrix.py:746: in dendrogram label=label, rotate=rotate) test-env/lib/python3.5/site-packages/seaborn/matrix.py:567: in __init__ self.dendrogram = self.calculate_dendrogram() test-env/lib/python3.5/site-packages/seaborn/matrix.py:644: in calculate_dendrogram color_threshold=-np.inf) test-env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py:2296: in dendrogram is_valid_linkage(Z, throw=True, name='Z') _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Z = array([], shape=(0, 4), dtype=float64), warning = False, throw = True name = 'Z' def is_valid_linkage(Z, warning=False, throw=False, name=None): """ Checks the validity of a linkage matrix. A linkage matrix is valid if it is a two dimensional array (type double) with :math:`n` rows and 4 columns. The first two columns must contain indices between 0 and :math:`2n-1`. For a given row ``i``, the following two expressions have to hold: .. math:: 0 \\leq \\mathtt{Z[i,0]} \\leq i+n-1 0 \\leq Z[i,1] \\leq i+n-1 I.e. a cluster cannot join another cluster unless the cluster being joined has been generated. Parameters ---------- Z : array_like Linkage matrix. warning : bool, optional When True, issues a Python warning if the linkage matrix passed is invalid. throw : bool, optional When True, throws a Python exception if the linkage matrix passed is invalid. name : str, optional This string refers to the variable name of the invalid linkage matrix. Returns ------- b : bool True if the inconsistency matrix is valid. """ Z = np.asarray(Z, order='c') valid = True name_str = "%r " % name if name else '' try: if type(Z) != np.ndarray: raise TypeError('Passed linkage argument %sis not a valid array.' % name_str) if Z.dtype != np.double: raise TypeError('Linkage matrix %smust contain doubles.' % name_str) if len(Z.shape) != 2: raise ValueError('Linkage matrix %smust have shape=2 (i.e. be ' 'two-dimensional).' % name_str) if Z.shape[1] != 4: raise ValueError('Linkage matrix %smust have 4 columns.' % name_str) if Z.shape[0] == 0: > raise ValueError('Linkage must be computed on at least two ' 'observations.') E ValueError: Linkage must be computed on at least two observations. test-env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py:1459: ValueError ==================== 1 failed, 12 passed in 150.50 seconds ===================== ```

Comments This is probably going to be difficult to fix.

References It appears

jairideout commented 7 years ago

For now, we will set a seed in the unit tests to prevent stochastic failure, and revert that commit when this bug is fixed.

ebolyen commented 7 years ago

PR #60 is in to set the seed @jairideout mentioned. It is possible we won't be able to revert it (if say the solution is to just raise an explanatory error, but we'll learn more later).

mortonjt commented 2 years ago

One possibility is to have the error message print out a histogram of this perfectly bad split -- there is a chance the user specified the wrong column