Closed mytarmail closed 3 months ago
That's a good question I honestly don't have an answer for. I'd really have to play around with how the balance of negative and positive forces play out when embedding new points. My guess would be that when the negative sampling occurs you are relying on a fairly large dataset so that a random choice of point is likely to be a non-neighbor, but because the training set is so small a non-trivial number of quite close neighbors are chosen and a repulsion is applied, pushing the test set point away, even if the attractive forces would be quite happy for the point to live "inside" the cluster. I'm just speculating though.
For what it's worth, this doesn't seem to be a bug in uwot
. Python UMAP also shows this ring-like behavior:
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn import datasets
from umap import UMAP
iris = datasets.load_iris()
iris_shuffled_data, iris_shuffled_target = shuffle(
iris.data, iris.target, random_state=42
)
X_train = iris_shuffled_data[:50]
y_train = iris_shuffled_target[:50]
X_test = iris_shuffled_data[50:]
y_test = iris_shuffled_target[50:]
reducer = umap.UMAP(random_state=42)
embedding_train = reducer.fit_transform(X_train)
embedding_test = reducer.transform(X_test)
fig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(12, 10))
ax[0].scatter(
embedding_train[:, 0],
embedding_train[:, 1],
c=y_train,
)
ax[1].scatter(
embedding_test[:, 0],
embedding_test[:, 1],
c=y_test,
)
plt.setp(ax[0], xticks=[], yticks=[])
plt.setp(ax[1], xticks=[], yticks=[])
ax[0].set_title("Training Set", fontsize=12)
ax[1].set_title("Test Set", fontsize=12)
plt.show()
Hello! Yes, this is definitely not a bug in your package. So I can't remove this effect? Because I have a small data set of 30-45 observations for which I will need to forecast another 100 new observations that are not available to me now. I also don't know how big the negative impact on the forecast will be from this effect.
Here are some random ideas:
umap_transform
by adding the ret_extra = c("nn")
argument (although I seem to have failed to update the documentation for the function to reflect that, oops).iris
example, so although it looks odd, for clustering it's probably still doing the job.n_neighbors
a bit which ameliorates the effect slightly.rbind
them into one new data frame/matrix).rbind
them into one new matrix)tumap
with X = combined_observations, init = combined_embedded_coordinates, learning_rate = 0.1
. You will have to experiment with the exact value of the learning_rate
, but you want it small enough that the original coordinates are not perturbed too much. This does help with the transformed coordinates in my experience, but it will also move the original "training" embedded coordinates. If that's not a deal-breaker it will at least visually make the new points look better.
Hi! I'm not a professional in this so my question may be strange, but why do the predicts look unrealistic and can they be made more consistent with the data?