lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.42k stars 806 forks source link

Multiple real valued labels #145

Open GCBallesteros opened 6 years ago

GCBallesteros commented 6 years ago

Hi,

I'm working on a regression problem with multiple real valued targets. An exception is thrown by UMAP (attached below). I assume that it happens because I'm passing a multidimensional array as labels. Am I doing something wrong or is this mode not supported by the algorithm/implementation?

Thanks for everything!

Edit: After digging into the parameters for umap I found target_metric which I set to 'l2', but I still get an error when my target has shape (n_samples, n_targets)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<timed exec> in <module>()

/usr/local/lib/python3.5/dist-packages/umap/umap_.py in fit_transform(self, X, y)
   1521             Embedding of the training data in low-dimensional space.
   1522         """
-> 1523         self.fit(X, y)
   1524         return self.embedding_
   1525 

/usr/local/lib/python3.5/dist-packages/umap/umap_.py in fit(self, X, y)
   1440                     far_dist = 1.0e12
   1441                 self.graph_ = categorical_simplicial_set_intersection(
-> 1442                     self.graph_, y, far_dist=far_dist
   1443                 )
   1444             else:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
lmcinnes commented 6 years ago

You are correct that multi-dimensional arrays are not supported for labels in the current implementation. Hopefully a future version could cope with this, although ultimately this would violate the sklearn API, so may need to be handled in a different way. In the meantime I can offer you a workaround. If you check issue #58 you can find some discussion of merging datasets that have different metrics. For your particular case I would suggest that this comment provides an outline for what you want to do -- just substitute the metric you want to use for the data, and the metric you want to use for the labels ('l2' in this case) for the 'bray-curtis' and 'jaccard' that was used in the example. You will likely want to play with the mix_ratio to get a good balance between the data and the labels.

GCBallesteros commented 6 years ago

Thanks for the suggestion and the quick reply. I will try it at as soon as I can and come back with the result. Cheers.

GCBallesteros commented 6 years ago

I've been trying to do a fit for each of my targets and then intersect all of the results without much success. I just get a big blob that actually looks worst than the results I get when I just do a fit without passing the targets.

I was thinking on going into the code and modify umap_.fit to change the call to sklearn.metrics.pairwise_distance so that distances are computed as l2norm(t1-t2) were t1 and t2 are rows of my targets matrix using sklearn.metrics.pairwise.euclidean_distances instead. Would this make sense?

fit1 = umap.UMAP(metric='l2').fit(X_umap ,y=np.squeeze(Y_umap[:, 0]))
fit2 = umap.UMAP(metric='l2').fit(X_umap ,y=np.squeeze(Y_umap[:, 1]))
fit3 = umap.UMAP(metric='l2').fit(X_umap ,y=np.squeeze(Y_umap[:, 2]))
# Intersect all graphs
intersection = umap.umap_.general_simplicial_set_intersection(fit1.graph_, fit2.graph_, weight=0.5)
intersection = umap.umap_.general_simplicial_set_intersection(intersection, fit3.graph_, weight=1/3.)
intersection = umap.umap_.reset_local_connectivity(intersection)

embedding = umap.umap_.simplicial_set_embedding(fit1._raw_data, intersection, fit1.n_components, 
                                                fit1.initial_alpha, fit1._a, fit1._b, 
                                                fit1.repulsion_strength, fit1.negative_sample_rate, 
                                                200, 'random', np.random, fit1.metric, 
                                                fit1._metric_kwds, False)
lmcinnes commented 6 years ago

Ah, I see. I think you want something more like:

fit1 = umap.UMAP(metric='l2').fit(X_umap)
fit2 = umap.UMAP(metric='l2').fit(Y_umap)
intersection = umap.umap_.general_simplicial_set_intersection(fit1.graph_, fit2.graph_, weight=0.25)
intersection = umap.umap_.reset_local_connectivity(intersection)
embedding = umap.umap_.simplicial_set_embedding(fit1._raw_data, intersection, fit1.n_components, 
                                                fit1.initial_alpha, fit1._a, fit1._b, 
                                                fit1.repulsion_strength, fit1.negative_sample_rate, 
                                                200, 'random', np.random, fit1.metric, 
                                                fit1._metric_kwds, False)

where the weight is a little arbitrary (you may have to play with it a little). That may well be essentially what you were describing doing above.

GCBallesteros commented 6 years ago

That worked beautifully! Thanks!

One question remains. How can I test new unseen data points. I tried using fit1.transform(test_features) because it was the only obvious thing to do but that didn't work. Any ideas?

Thanks again for the awesome code!

lmcinnes commented 6 years ago

I think, unfortunately, that transforming new points through this custom pipeline is going to be non-trivial. It can be done, but I will have to work out exactly what incantations one would need to do so.

On Fri, Sep 28, 2018 at 3:39 AM GCBallesteros notifications@github.com wrote:

That worked beautifully! Thanks!

One question remains. How can I test new unseen data points. I tried using fit1.transform(test_features) because it was the only obvious thing to do but that didn't work. Any ideas?

Thanks again for the awesome code!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/145#issuecomment-425350160, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBXRZPG0h1cCbEeTKXD75ohqBqworks5ufdIxgaJpZM4W4Zo7 .

hlzl commented 3 years ago

Ah, I see. I think you want something more like:

fit1 = umap.UMAP(metric='l2').fit(X_umap)
fit2 = umap.UMAP(metric='l2').fit(Y_umap)
intersection = umap.umap_.general_simplicial_set_intersection(fit1.graph_, fit2.graph_, weight=0.25)
intersection = umap.umap_.reset_local_connectivity(intersection)
embedding = umap.umap_.simplicial_set_embedding(fit1._raw_data, intersection, fit1.n_components, 
                                                fit1.initial_alpha, fit1._a, fit1._b, 
                                                fit1.repulsion_strength, fit1.negative_sample_rate, 
                                                200, 'random', np.random, fit1.metric, 
                                                fit1._metric_kwds, False)

where the weight is a little arbitrary (you may have to play with it a little). That may well be essentially what you were describing doing above.

This seems to no longer work as is and instead throws AttributeError: 'UMAP' object has no attribute 'initial_alpha'.

Is there are way to get inital_alpha from somewhere or should I set it arbitrarily? I couldn't find anything about the parameter in the documentation.

Additionally, simplicial_set_embedding() seems to now require the parameters densmap, densmap_kwds and output_dens even if densMAP is not used?

ZZKnight commented 3 years ago

Is the multi-label supervised/semi-supervised learning option available now?

lmcinnes commented 3 years ago

I think the best bet right now is to intersect with the label data via the model composition (i.e. build a model on data, a different model on labels and use the * operator on the models) -- see https://umap-learn.readthedocs.io/en/latest/composing_models.html

ZZKnight commented 3 years ago

Thanks for you suggestions.