jlmelville / uwot

An R package implementing the UMAP dimensionality reduction method.
https://jlmelville.github.io/uwot/
GNU General Public License v3.0
315 stars 31 forks source link

unrealistic point predict #128

Closed mytarmail closed 2 weeks ago

mytarmail commented 3 weeks ago

Hi! I'm not a professional in this so my question may be strange, but why do the predicts look unrealistic and can they be made more consistent with the data?

set.seed(53)
X <- iris[sample(150),]
tr <- 1:50
ts <- 51:150
library(uwot)
um <- tumap(X[tr,-5], ret_model = TRUE)
pr <- umap_transform(X[ts,-5], model = um)
plot(um$embedding, lwd=3, col=X$Species[tr])
points(pr, col=X$Species[ts])

Screenshot_3

jlmelville commented 3 weeks ago

That's a good question I honestly don't have an answer for. I'd really have to play around with how the balance of negative and positive forces play out when embedding new points. My guess would be that when the negative sampling occurs you are relying on a fairly large dataset so that a random choice of point is likely to be a non-neighbor, but because the training set is so small a non-trivial number of quite close neighbors are chosen and a repulsion is applied, pushing the test set point away, even if the attractive forces would be quite happy for the point to live "inside" the cluster. I'm just speculating though.

For what it's worth, this doesn't seem to be a bug in uwot. Python UMAP also shows this ring-like behavior:

import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn import datasets
from umap import UMAP

iris = datasets.load_iris()

iris_shuffled_data, iris_shuffled_target = shuffle(
    iris.data, iris.target, random_state=42
)

X_train = iris_shuffled_data[:50]
y_train = iris_shuffled_target[:50]

X_test = iris_shuffled_data[50:]
y_test = iris_shuffled_target[50:]

reducer = umap.UMAP(random_state=42)
embedding_train = reducer.fit_transform(X_train)
embedding_test = reducer.transform(X_test)

fig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(12, 10))
ax[0].scatter(
    embedding_train[:, 0],
    embedding_train[:, 1],
    c=y_train,
)
ax[1].scatter(
    embedding_test[:, 0],
    embedding_test[:, 1],
    c=y_test,
)
plt.setp(ax[0], xticks=[], yticks=[])
plt.setp(ax[1], xticks=[], yticks=[])
ax[0].set_title("Training Set", fontsize=12)
ax[1].set_title("Test Set", fontsize=12)
plt.show()

image

mytarmail commented 3 weeks ago

Hello! Yes, this is definitely not a bug in your package. So I can't remove this effect? ​​Because I have a small data set of 30-45 observations for which I will need to forecast another 100 new observations that are not available to me now. I also don't know how big the negative impact on the forecast will be from this effect.

jlmelville commented 3 weeks ago

Here are some random ideas: