Closed hageldave closed 6 months ago
UMAP returns the coordinates on the largest connected components. Your random data doesn't have intristinc structure.
I don't think that is a correct bahavior. You may prove me wrong with supporting references to papers / manuals / API documentations. When I look at the corresponding Python implementation https://umap-learn.readthedocs.io/en/latest/api.html#umap.umap_.UMAP.fit_transform I see that there is a clear correspondence between the input and output size, and this is not subject to how good or bad my dataset is suited to be embedded with umap.
In the case here, I don't even know which samples were succesfully mapped.
Edit: just to be sure I tried the same thing in python.
So I've looked into smile's UMAP code, and I think I found something suspicious.
Here on the following line the k-nearest-neighbor-graph (KNNG) is computed (not approximated as the comment suggests, which is fine but may take more time) https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L243
Then, on the next line, this KNNG is truncated by extracting only its largest connected component.
https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L244
I think this is incorrect and there is no need to diminish the KNNG before handing it to the computeFuzzySimplicialSet(AdjacencyList, int, int)
method.
What do you think? (I have not tried this yet due to lack of time)
Just did a quick and dirty test, and got it to run and output the correct number of projected points. Here are the changes:
https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L246
the nng object is not needed. change to:
graph = computeFuzzySimplicialSet(graph, k, 64);
https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L262
nng.index is no longer in use, replace by indices of whole graph:
return new UMAP2(IntStream.range(0, data.length).toArray(), coordinates, graph);
https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L461
since using the complete graph now (not only a subset of it) this matrix is usually quite large. EVD is too slow, instead use SVD. This is possible since the graph Laplacian is positive semi definite and thus
L = AA^T [cholesky] = USV^T [singular] = US^(1/2)S^(1/2)V^T [U=V] = QΛQ^T [eigen]
So we replace it by:
Matrix.SVD svd = ARPACK.svd(L, Math.min(d, n));
https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L464
get the right eigenvectors from SVD now:
Matrix V = svd.V;
Any objections?
Not sure if it is correct to use a disjointed graph in math.
Describe the bug When running UMAP on a data array with 600 rows and 3 columns of uniform samples, UMAP returns an array with 29 rows and 2 columns or 31 rows (nondeterministic number of rows)
Expected behavior I would expect that it returns the same number of rows (600) so that there is a 1-to-1 mapping from input samples to output projection.
Actual behavior Returns 29 or 28 or 31 rows (probably depending on the randomness of the samples)
Code snippet
Input data input is generated by code snippet
Additional context
<groupId>com.github.haifengl</groupId><artifactId>smile-core</artifactId><version>2.5.0</version>