Different results since v2

libscran / umappp

C++ port of the UMAP algorithm

https://libscran.github.io/umappp/

BSD 2-Clause "Simplified" License

42 stars 15 forks source link

Different results since v2 #24

Open hlesemann opened 4 days ago

hlesemann commented 4 days ago

Since upgrading to version 2 I seem to be getting rather strange results with the same data. image (1) This is what it looked like and what I would expect from umap. In contrast this what I get now with the same data. Any idea what could cause a structure like this?

LTLA commented 4 days ago

Hm. Nothing obvious comes to mind. The only intended change was an increase in min_dist to match the defaults from uwot.

If I had to speculate, I would say that this looks a bit like the UMAP immediately after spectral initialization, i.e., no epochs. Did you call run() on the Status object?

hlesemann commented 4 days ago

Yes sure. I was trying around with the options and was able to get better results, but they always looked kind of rectangular in a way.

LTLA commented 4 days ago

I have no idea - I don't place any bounds on the coordinates.

Looking through the code, the only other intended change was how the a/b values were calculated, but that should just involve a change in the algorithm to improve convergence.

If you can send me a MRE, I can have a look at it. Just the nearest neighbor results (i.e., indices and distances) should be sufficient.

Edit: Another helpful thing would be to check whether the initialization or the iterations are affected, i.e., do you get comparable results with num_epochs = 0 or are they already different at that point?

jlmelville commented 3 days ago

If you can plot the actual value of the coordinate axes that can help diagnose issues: an initialization where the interpoint distances are large could lead to problems where most edges get effectively zero gradient and never update.

Points coalescing at the edges of a plot can also be a sign of a too-high learning rate although I am surprised that the UMAP gradient does this as I have usually only experienced this with different dimensionality embedding methods (idiotic ones of my own invention usually).

hlesemann commented 3 days ago

Yes this is what I got with num_epochs set to zero:

Can you give an example what you mean by giving you the nearest neighbors? Don't know if this also helps but here's an extract of the dataset mre.json

LTLA commented 1 day ago

I'm assuming that, in your mre.json, each inner array contains the coordinates for an observation... but it seems that different observations have different numbers of coordinates. In R:

out <- jsonlite::fromJSON("mre.json", simplifyVector=FALSE)
table(lengths(out))
##   13   20
##  689 4263

In other words, most observations have 20 coordinates but others have 13. I'm not sure how you're doing the neighbor detection across different numbers of coordinates, but umappp won't do anything for missing values.

Nonetheless, I tried to proceed by just taking the first 13 coordinates for each observation and running it through. (For convenience, I'm using the umappp R package in tests/R/package, which just wraps the C++ code.)

mat <- do.call(rbind, lapply(out, function(x) head(unlist(x), 13)))
nn <- BiocNeighbors::findKNN(mat, k=15)
coords <- umappp::runUmap(nn$index, nn$distance)
plot(coords[,1], coords[,2])

test

Looks pretty UMAP-ish to me.