haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.99k stars 1.12k forks source link

UMAP does not output the same amount of instances as input #695

Closed hageldave closed 6 months ago

hageldave commented 2 years ago

Describe the bug When running UMAP on a data array with 600 rows and 3 columns of uniform samples, UMAP returns an array with 29 rows and 2 columns or 31 rows (nondeterministic number of rows)

Expected behavior I would expect that it returns the same number of rows (600) so that there is a 1-to-1 mapping from input samples to output projection.

Actual behavior Returns 29 or 28 or 31 rows (probably depending on the randomness of the samples)

Code snippet

    public static void main(String[] args) {
        double[][] data = new double[600][3];
        for(int i=0; i<data.length; i++) {
            for(int j=0; j<3; j++) {
                data[i][j] = Math.random();
            }
        }
        UMAP umap = UMAP.of(data, 2);
        if(umap.coordinates.length != data.length)
            throw new IllegalStateException("non bijective mapping");
    }

Input data input is generated by code snippet

Additional context

haifengl commented 2 years ago

UMAP returns the coordinates on the largest connected components. Your random data doesn't have intristinc structure.

hageldave commented 2 years ago

I don't think that is a correct bahavior. You may prove me wrong with supporting references to papers / manuals / API documentations. When I look at the corresponding Python implementation https://umap-learn.readthedocs.io/en/latest/api.html#umap.umap_.UMAP.fit_transform I see that there is a clear correspondence between the input and output size, and this is not subject to how good or bad my dataset is suited to be embedded with umap.

In the case here, I don't even know which samples were succesfully mapped.

Edit: just to be sure I tried the same thing in python. grafik

hageldave commented 2 years ago

So I've looked into smile's UMAP code, and I think I found something suspicious. Here on the following line the k-nearest-neighbor-graph (KNNG) is computed (not approximated as the comment suggests, which is fine but may take more time) https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L243 Then, on the next line, this KNNG is truncated by extracting only its largest connected component. https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L244 I think this is incorrect and there is no need to diminish the KNNG before handing it to the computeFuzzySimplicialSet(AdjacencyList, int, int) method.

What do you think? (I have not tried this yet due to lack of time)

hageldave commented 2 years ago

Just did a quick and dirty test, and got it to run and output the correct number of projected points. Here are the changes:


https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L246 the nng object is not needed. change to: graph = computeFuzzySimplicialSet(graph, k, 64);


https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L262 nng.index is no longer in use, replace by indices of whole graph: return new UMAP2(IntStream.range(0, data.length).toArray(), coordinates, graph);


https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L461 since using the complete graph now (not only a subset of it) this matrix is usually quite large. EVD is too slow, instead use SVD. This is possible since the graph Laplacian is positive semi definite and thus L = AA^T [cholesky] = USV^T [singular] = US^(1/2)S^(1/2)V^T [U=V] = QΛQ^T [eigen] So we replace it by: Matrix.SVD svd = ARPACK.svd(L, Math.min(d, n));


https://github.com/haifengl/smile/blob/9ee0aa73202ad8631ea6a7dc6a58b887d95be6c1/core/src/main/java/smile/manifold/UMAP.java#L464 get the right eigenvectors from SVD now: Matrix V = svd.V;


Any objections?

haifengl commented 2 years ago

Not sure if it is correct to use a disjointed graph in math.