unrealistic point predict

mytarmail commented 3 months ago

Hi! I'm not a professional in this so my question may be strange, but why do the predicts look unrealistic and can they be made more consistent with the data?

set.seed(53)
X <- iris[sample(150),]
tr <- 1:50
ts <- 51:150
library(uwot)
um <- tumap(X[tr,-5], ret_model = TRUE)
pr <- umap_transform(X[ts,-5], model = um)
plot(um$embedding, lwd=3, col=X$Species[tr])
points(pr, col=X$Species[ts])

Screenshot_3

jlmelville commented 3 months ago

That's a good question I honestly don't have an answer for. I'd really have to play around with how the balance of negative and positive forces play out when embedding new points. My guess would be that when the negative sampling occurs you are relying on a fairly large dataset so that a random choice of point is likely to be a non-neighbor, but because the training set is so small a non-trivial number of quite close neighbors are chosen and a repulsion is applied, pushing the test set point away, even if the attractive forces would be quite happy for the point to live "inside" the cluster. I'm just speculating though.

For what it's worth, this doesn't seem to be a bug in uwot. Python UMAP also shows this ring-like behavior:

import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn import datasets
from umap import UMAP

iris = datasets.load_iris()

iris_shuffled_data, iris_shuffled_target = shuffle(
    iris.data, iris.target, random_state=42
)

X_train = iris_shuffled_data[:50]
y_train = iris_shuffled_target[:50]

X_test = iris_shuffled_data[50:]
y_test = iris_shuffled_target[50:]

reducer = umap.UMAP(random_state=42)
embedding_train = reducer.fit_transform(X_train)
embedding_test = reducer.transform(X_test)

fig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(12, 10))
ax[0].scatter(
    embedding_train[:, 0],
    embedding_train[:, 1],
    c=y_train,
)
ax[1].scatter(
    embedding_test[:, 0],
    embedding_test[:, 1],
    c=y_test,
)
plt.setp(ax[0], xticks=[], yticks=[])
plt.setp(ax[1], xticks=[], yticks=[])
ax[0].set_title("Training Set", fontsize=12)
ax[1].set_title("Test Set", fontsize=12)
plt.show()

mytarmail commented 3 months ago

Hello! Yes, this is definitely not a bug in your package. So I can't remove this effect? Because I have a small data set of 30-45 observations for which I will need to forecast another 100 new observations that are not available to me now. I also don't know how big the negative impact on the forecast will be from this effect.

jlmelville commented 3 months ago

Here are some random ideas:

First, I wouldn't want to make a numerical forecast for these new observations based on the 2D coordinates of the UMAP output. I probably wouldn't want to do so for any dimensionality reduction method that went to 2D without very strong evidence I had preserved whatever property I thought was important to a sufficiently high degree of accuracy. And that's without taking into account the variation introduced by the stochastic nature of the optimization and the effect of the starting coordinates. So if that's your use case, I would proceed with caution anyway.
For quantitative work, I would be more inclined to use a 2D or 3D embedding as a Quality Control step to look for things that seem very wrong or unusual. For something more quantitative, I would go back to the high-dimensional k-nearest neighbor graph, which you can get out of umap_transform by adding the ret_extra = c("nn") argument (although I seem to have failed to update the documentation for the function to reflect that, oops).
If your forecast involves assigning the new observations to a cluster, then the current embedding is still getting that right qualitatively, at least in the iris example, so although it looks odd, for clustering it's probably still doing the job.
For small datasets, you could reduce n_neighbors a bit which ameliorates the effect slightly.
As you don't have a lot of data, and the new data that's coming in is also quite small, could you just throw away the original embedding and start again from scratch with you combined 45 + 100 observations?
If you need to keep the old embedding, you could "fine-tune" the original embedding by:
- Following the steps you have shown above by doing the "training" on the original observations, then transforming the new observations.
- Combine the new and old observations (e.g. rbind them into one new data frame/matrix).
- Combine the new and old embedded coordinates (rbind them into one new matrix)
- Re-run tumap with X = combined_observations, init = combined_embedded_coordinates, learning_rate = 0.1. You will have to experiment with the exact value of the learning_rate, but you want it small enough that the original coordinates are not perturbed too much. This does help with the transformed coordinates in my experience, but it will also move the original "training" embedded coordinates. If that's not a deal-breaker it will at least visually make the new points look better.

jlmelville / uwot

unrealistic point predict #128