scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.79k stars 500 forks source link

Test suite fails intermittently in test_flat.py #420

Open markopy opened 4 years ago

markopy commented 4 years ago

The tests in test_flat.py seem to randomly fail. Here are some examples:

hdbscan/flat.py:703: UserWarning: HDBSCAN can only compute 24 clusters. Setting n_clusters to 24...
  warn(f"HDBSCAN can only compute {len(lambdas)+1} clusters. "
======================================================================
FAIL: Verify membership vector produces as many clusters as requested
----------------------------------------------------------------------
Traceback (most recent call last):
  File "nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "hdbscan/tests/test_flat.py", line 287, in test_mem_vec_diff_clusters
    assert_equal(memberships.shape[1], n_clusters_predict)
AssertionError: 10 != 9

======================================================================
FAIL: Verify membership vector produces as many clusters as requested
----------------------------------------------------------------------
Traceback (most recent call last):
  File "nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "hdbscan/tests/test_flat.py", line 368, in test_all_points_mem_vec_diff_clusters
    assert_equal(memberships.shape[1], n_clusters_predict)
AssertionError: 10 != 9
hdbscan/flat.py:703: UserWarning: HDBSCAN can only compute 25 clusters. Setting n_clusters to 25...
  warn(f"HDBSCAN can only compute {len(lambdas)+1} clusters. "
======================================================================
FAIL: Verify that approximate_predict_flat produces as many clusters as asked
----------------------------------------------------------------------
Traceback (most recent call last):
  File "nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "hdbscan/tests/test_flat.py", line 227, in test_approx_predict_diff_clusters
    assert_equal(n_clusters_out, n_clusters_predict)
AssertionError: 13 != 12

======================================================================
FAIL: Verify membership vector produces as many clusters as requested
----------------------------------------------------------------------
Traceback (most recent call last):
  File "nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "hdbscan/tests/test_flat.py", line 287, in test_mem_vec_diff_clusters
    assert_equal(memberships.shape[1], n_clusters_predict)
AssertionError: 10 != 9

======================================================================
FAIL: Verify membership vector produces as many clusters as requested
----------------------------------------------------------------------
Traceback (most recent call last):
  File "nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "hdbscan/tests/test_flat.py", line 368, in test_all_points_mem_vec_diff_clusters
    assert_equal(memberships.shape[1], n_clusters_predict)
AssertionError: 10 != 9

@sabarish-akridata

gansanay commented 3 years ago

Hi @markopy,

I am encountering the same kind of assertion errors in this file and trying to solve this.

Can you tell me which versions of Python and HDBSCAN dependencies (numpy, scipy, cython) you used?

Regards, Guillaume

markopy commented 3 years ago

@gansanay not 100% sure what my environment was for the initial report but I just ran it again:

======================================================================
FAIL: Verify that approximate_predict_flat produces as many clusters as asked
----------------------------------------------------------------------
Traceback (most recent call last):
  File "nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "hdbscan/tests/test_flat.py", line 227, in test_approx_predict_diff_clusters
    assert_equal(n_clusters_out, n_clusters_predict)
AssertionError: 11 != 12
----------------------------------------------------------------------
$ python -V
Python 3.7.7

$ pip freeze
Cython==0.29.7
hdbscan==0.8.26
joblib==0.17.0
nose==1.3.7
numpy==1.16.3
scikit-learn==0.20.3
scipy==1.2.1
six==1.15.0

The above versions are rather old but the issue persists after updating to:

$ python -V
Python 3.7.7

$ pip freeze
Cython==0.29.21
hdbscan==0.8.26
joblib==1.0.0
nose==1.3.7
numpy==1.19.5
scikit-learn==0.23.2
scipy==1.6.0
six==1.15.0
threadpoolctl==2.1.0
sabarish-akridata commented 3 years ago

Hi @markopy, Thanks for pointing that out. I pushed these changes, and it was a mistake I made in writing the tests. Sometimes the flat clustering from HDBSCAN cannot produce as my clusters as requested, because the hierarchy doesn't have enough branches/leaves. I forgot about this when writing those tests, and failed to test it enough times. This is a problem with the tests, and not with the code that is being tested. I'll fix these tests at some point.

gansanay commented 3 years ago

Hi @sabarish-akridata,

In the meantime could you point which tests are to be skipped?

🙏