Open drob-xx opened 1 year ago
@viclafargue @dantegd @cjnolet what do you think about making hash_input=True
the default. Seems pretty reasonable on the surface, but interested in your thoughts.
Victor shared some additional context in a different issue.
EDIT: Looks like we already reached some level of agreement in that issue. This feels like a good first issue for a new contributor, but I'm going to tag it in the other issue.
I'm fine with that. It does introduce an additional overhead, which is why we made the default false to begin with. Maybe we could add a quick doc to the argument that states it's true by default but it comes with an overhead so if the user will never expect to be doing fit(A).transform(A)
and expecting the exact same results then they can disable it.
I think it would be great to develop a bit more the docs to explain the hash_input
attribute. I am running the UMAP implementation in cuml with the mnist dataset from tenstoflow.keras
, and I get significantly different results depending on whether hash_input
is True or not. I tried generating a UMAP using the embedding_
attr, the output of transform()
on the training data, and also the output of transform()
on test data.
My finding is that the value of hash_input
only dramatically affects the output whenever I run transform()
on the training data.
I am really wondering why is this weird blob being produced when hash_input=False
and I transform(training_data)
. I don't see how would anyone prefer that over the corresponding output when hash_input=True
. The fact that the limits of the axes are much bigger helps me understand why the points seem to converge into the blob, but then I don't get why the limits are getting so much bigger (from around -10/10 to -40/40). Not only that, the position of the points with respect to one another is clearly less distinguishable (example: cluster of points for digit 1).
I am also really wondering why this blob does not occur when transform(test_data)
, which is a relief because that suggests the fitted algorithm will be able to compress other datasets.
To replicate the figures, run the script below (requires tensorflow.keras
numpy
matplotlib
and cuml
)
PS n_components=2, n_neighbors=60, min_dist=0.0, random_state=42
are taken from an existing program which uses the CPU umap.UMAP
implementation, and I would like to keep them the same so that I can compare both implementations (unless a good reason to change the parameter exists)
import os.path
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.datasets import mnist # 60k datapoints
from cuml.manifold.umap import UMAP as cuUMAP
def load_tf_mnist():
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
# We'll just use the training data for simplicity
images = np.reshape(train_images, (len(train_images), -1))
test_images = np.reshape(test_images, (len(test_images), -1))
return (images, train_labels), (test_images, test_labels)
def plot_umap(embedding, labels):
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP projection of the MNIST dataset', fontsize=24)
def main():
(data, labels), (test_data, test_labels) = load_tf_mnist()
# hash_input = True
learned_embeddings = cuUMAP(n_components=2, n_neighbors=60, min_dist=0.0, random_state=42, hash_input=True).fit(data)
plot_umap(learned_embeddings.transform(data), labels)
plt.savefig(os.path.join("gpu_umap_True_from_transform.png"))
plt.clf()
plot_umap(learned_embeddings.embedding_, labels)
plt.savefig(os.path.join("gpu_umap_True_from_embedding_.png"))
plt.clf()
plot_umap(learned_embeddings.transform(test_data), test_labels)
plt.savefig(os.path.join("gpu_umap_True_test_data.png"))
plt.clf()
# hash_input = False
learned_embeddings = cuUMAP(n_components=2, n_neighbors=60, min_dist=0.0, random_state=42, hash_input=False).fit(data)
plot_umap(learned_embeddings.transform(data), labels)
plt.savefig(os.path.join("gpu_umap_False_from_transform.png"))
plt.clf()
plot_umap(learned_embeddings.embedding_, labels)
plt.savefig(os.path.join("gpu_umap_False_from_embedding_.png"))
plt.clf()
plot_umap(learned_embeddings.transform(test_data), test_labels)
plt.savefig(os.path.join("gpu_umap_False_test_data.png"))
plt.clf()
if __name__ == "__main__":
main()
Any help would be very much appreciated, thanks!
For the sake of completion, these are the UMAPs if I transform with the training and test data
Updated script: test_hashinput.zip
Now it's even more confusing, because when hash_input=True
one can still get the garbled output. But I don't get why it's fine with train and test sets separately, and not when combined.
After using umap-learn for some time I've written code that relies on
embedding_
== reduction fromtransform()
. I just found out that without settinghash_input=True
this will not be the case with cuML's UMAP. I was a bit surprised. I have since re-read the documentation and while this difference is noted it seems to me something of an unfortunate "gotchya". Perhaps I'm missing something but it seems like the more conservative approach would be to default to the behavior umap-learn and provide additional tuning parameters for those who want to use them. At a minimum it might be nice to have a warning here.