lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.4k stars 805 forks source link

Feature Request: Include loss_ as an attribute for the fitting #174

Open jolespin opened 5 years ago

jolespin commented 5 years ago

First off, thank you for implementing this method in Python! Very stoked to start using it for my bioinformatics datasets. I have been trying to quantify which parameters are the best for my datasets and having some trouble. I was wondering if the loss provided in your Enthought talk from SciPy 2018: image

Could be included as an attribute we can access later so I can figure out which of my hyperparameters settings should be used?

I have a precomputed data matrix (137 x 137) and using the following hyperparameter configs:

for n_neighbors in [3,4,5,6,7]:
    for min_dist in [0.01, 0.1,0.2,0.3,0.5]:
        for spread in [0.01, 0.1,0.2,0.3,0.5]:
            for learning_rate in [1e-3, 1e-2,1e-1,1]:

There is some structure that fits my hypothesis from some of these configs and I want to know which one specifically I should choose so I thought maybe so sort of loss_ metric would be really useful in this scenario.

Also. if you have a few moments. Can you describe why I see this topology sometimes where it looks like a regression? image image

lmcinnes commented 5 years ago

Computing the full loss is potentially computationally nightmarish for large datasets, which is why it is not currently done. As for your current issue with weird structures appearing -- I believe that is due to a bug that was very recently fixed (see issue #170 ). If you upgrade to the latest version on pip, or install directly from the current master branch on github it should fix that problem.

jolespin commented 5 years ago

I installed via pip install git+https://github.com/lmcinnes/umap a few days ago. I found some other parameter combinations that have very interpretable results so I will stick with those :) image

Do you have any suggestions on ways to compare hyperparamter methods? I've manually gone through about 500 parameter configs and definitely noticed a pattern. The only thing I could think of was to cluster and do a silhouette score but is there anything I could grab from the actual model that could quantify that params_A are better than params_B?

lmcinnes commented 5 years ago

You shouldn't need a learning rate that low -- I suspect that that was simply working around the bug that got fixed very recently. You may want to try again with fresh code and see if you can get away with a higher learning rate.

As to comparing hyperparameters ... the main ones to change are n_neighbors and min_dist. Ultimately there are not right values, nor is one any better than the others; they are simply different views of your data. You can think of it an being somewhat loosely analogous to looking at 3D data from different viewing angles -- no one angle is more true than any other, but some angles may highlight different properties of the data than others. The parameters for UMAP are not quite so simple, but it comes down to a similar thing -- they offer you different lenses on the data, and ultimately the lens that helps you see relevant things is the useful one (as opposed to being the true one).

jolespin commented 5 years ago

Thanks, this is really helpful. What range of learning_rates would you recommend if one were to adjust these?

lmcinnes commented 5 years ago

I would start with a learning rate of 1.0 (the default). You can scale it down a bit if you need, but realistically you shouldn't need to get too much smaller than 0.5 I would imagine.