Open kc-howe opened 8 months ago
Thanks for the issue @kc-howe! We have identified a few ill-behaviors and issues with spectral clustering from RAFT that affect UMAP in particular. We will be working on solving them, but we don't have an ETA yet, but is in our roadmap as we work on RAFT, cuML and cuVS in the next few releases.
For anyone who comes across on this issue, please see discussion in https://github.com/rapidsai/cuml/issues/5707#issuecomment-2115156120
Hey everyone,
Sorry for being late to this discussion. I think one of the problems here might be the assumption that cuML's UMAP will always yield the exact same results as the CPU-based reference implementation for the exact same parameter settings. We did some parallelism magic on the GPU side to speed up the algorithm and as a result, it's possible that some of the parameters (such as the number of iterations for the solver) might need to be tweaked a bit.
In addition, the underlying GPU-accelerated spectral embedding initialization primitive has gotten fairly old by this point and hasn't been updated in quite some time so it's been accumulating little bugs as CUDA versions increase and the code itself becomes more stale. I suggest trying to use the random initialization, along with adjusting the number of neighbors and the number of iterations to see if that improves the quality of your embeddings.
We have an engineer ramping up to fix the spectral clustering initialization and they will also be working to improve the end to end quality of the results. Again, I apology sincerely for the delay in replying to this thread.
Describe the bug
UMAP spectral initialization yields unexpected initial layout results. Consequently, global structure of the input data often is not preserved, even when a very high
n_epochs
parameter is used. If UMAP is used as a pre-processing step for clustering, this behavior can impact results significantly, depending on the geometry of the input data.Steps/Code to reproduce bug
The code below provides a minimal example which exhibits the issue on scikit-learn's
make_circles
dataset. Since the results of the cuML UMAP implementation may vary despite setting arandom_state
seed, one may have to repeatedly run the code below in order to get a result which demonstrates the problematic behavior.The resulting plot should look something like this:
Expected behavior
Spectral initialization should yield similar global structure as the CPU version of UMAP in the initial layout and consequently after layout optimization. For the most part at least, spatially distinct connected components of the input data should stay separated after embedding; UMAP with spectral initialization should not make clustering such cases more difficult (see rings example below).
The results of CPU and GPU UMAP should not be identical, of course, as there are understandably some implementation differences, particularly in regards to spectral embedding. However, general behavior with respect to global structure preservation under spectral initialization should be the same.
The following figure demonstrates the extreme difference in spectral initialization behavior between CPU and GPU UMAP on scikit-learn's
make_blobs
. Note that it seems impossible to setn_epochs
to precisely 0 in the cuML implementation without invoking default values, so a minimal value of 1 is used below. (Apologies for the small text, CPU results are in the top row, GPU results are in the bottom row.)For a dataset like
make_blobs
, the initialization is very different, but the CPU and GPU results can generally be made the same by running at a high number of epochs. Here is a similar figure demonstrating the difference on a dataset comprised of three non-concentric rings, which cannot be improved by setting highn_epochs
(see further below):I do not suspect that the difference in spectral initialization results can always be resolved by raising the
n_epochs
parameter, as has been suggested as a potential resolution to similar issues reported by other users (e.g. #5474). Since the low-dimensional layout optimization only acts on KNN-localized edge weights, I don't expect that any number of epochs would promise the recovery of global structure. To be sure, we can check that atn_epochs
values of 500 and 2000 we observe the same difference in behavior between CPU and GPU UMAP for the three rings dataset above.Environment details (please complete the following information):
Method of cuDF & cuML install: conda (environment details below)
Additional context
I initially felt this could be related to the known issue with the Laplacian eigenmaps solver mentioned in the comments of #5474, however the differences in results compared to the CPU solvers seem somewhat extreme.
I am also aware that CPU UMAP handles spectral layout of networks with multiple components somewhat differently than single-component networks. However, datasets which yield single-component graphs may exhibit the same behavior as above, e.g. when embedding a single ring.
I am happy to provide any additional code, examples, or environment details upon request.
Additionally, thank you all for the incredible work you do on this repository, and in particular for bringing UMAP to the GPU. You guys are phenomenal, and your efforts here are so deeply appreciated!