lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.42k stars 806 forks source link

Rolling window umap #137

Closed MNoichl closed 6 years ago

MNoichl commented 6 years ago

TLDR: Is there a way to implement rolling time windows in umap, where new samples are embedded and old ones (and their influence on the embedding) are removed?

Thank you so much for umap, it has been very, very useful for me.

I am mainly using it to process bibliograpic data of scientific papers. Example: https://homepage.univie.ac.at/noichlm94/full/structphil/index.html

I made visualizations with embeddings in two dimensions, and am currently working on temporal embeddings, as shown below. (Every point represents a paper, their distance is computed from co-citations.)

download 7

The second embedding was made by reducing to one dimension which was then spread out over several years. This is not really satisfactory but, we can already see how clusters fade out after some time.

I have now experimented with doing one embedding (_fit) of the whole data, and then embedding every year seperately (_transform).

This has greatly improved the results, but is not totally satifying, as we see every year from the perspective of the whole history. Example from a different project (please ignore the heading):

download 9

I think it would be better (and epistemically more sound) if I could embed every year in the year(s) directly before it. For that I would have to embedd, lets say, the first five years, transform the sixth according to them, and then somehow remove the first, and so on.

Is there any way of doing this, while keeping the whole embedding constant?

lmcinnes commented 6 years ago

First of all, thanks for the link to your work -- it looks awesome!

As to your question, I am not sure I have a solution that completely fits your needs, but it may be sufficient.

You could embed the first 5 years, then transform the 6th. Then take the embedding of the points for all but the first year and save it off. The do an embedding for the next window (years 2-6) but use the ability to provide an initialisation in the form of pre-specified points (you can pass init=<numpy array>) to initialise that embedding with the positions for those points that you saved off. You can then transform in the 7th year, save off the appropriate embedding data and repeat the process, sliding along. That's rather compute intensive, so you might want to consider larger windows and longer slides if it is getting to be too much, but it will mostly do the job.

So the downside is that it won't hold the embedding fixed throughout -- it will evolve over time. On the other hand by setting the initialisation to the previous embedding you should have fairly consistent embedding from window to window so it will drift rather than change dramatically each time.

I'm not sure if it meets your needs, but hopefully it might work.

Please let me know how your project ends -- it looks intriguing!

MNoichl commented 6 years ago

Thank you so much for the encouragement and the comment. I came up with this code that I think implements what you describe. Weirdly my Jupyter-Kernel (and nteract, and running as .py) dies somewhere in the loop around 70% of the time for window_size=1, and 95% of the time for window_size<1.

Code:

import umap
from scipy import stats

n_neighbors = 10
window_size = 2
l = list(np.where(drc["year"] - drc["year"].min() <= window_size)[0]) #drc is where I keep the bibliographic data
L = M[l] #M is the sparse Matrix where I keep the vectors

trans = umap.UMAP(n_components=1, n_neighbors=n_neighbors, min_dist=0.01, metric='cosine').fit(L.toarray())

coordinates = []
for year in range(int(drc["year"].min()+window_size),int(drc["year"].max()-window_size)):
    l = list(np.where(drc["year"] == year)[0])
    L = M[l]

    z = []
    for w in range(0,window_size):
        k = list(np.where(drc["year"] == year+w-1)[0])
        z = z + k
    new_list = M[z]        

    if len(l) > n_neighbors:
        #Transform data on trans
            old_embedding = trans.embedding_
            #old_embedding = old_embedding[~np.isnan(old_embedding)]
            embedding = stats.zscore(trans.transform(L.toarray()))

            emb = pd.DataFrame(embedding)
            emb.columns = ['xI']

            emb["year"] = drc.iloc[l,:]["timestamp"].tolist()
            coordinates.append(emb)
            coordinates_full_df = pd.concat(coordinates, ignore_index=True)
            coordinates_full_df.to_csv('yearly_output.csv')
        #Renew trans:
            while True: #We have to do this, as umap sometimes fails and only returns NaNs
                trans = umap.UMAP(n_components=1,
                          n_neighbors=n_neighbors,#small => local, large => global: 5-50
                          min_dist=0.01, #small => local, large => global: 0.001-0.5
                          metric='cosine', 
                          init = old_embedding).fit(new_list.toarray())
                if np.isnan(trans.embedding_).any() ==True:
                    continue
                else:
                    break

coordinates_full_df = pd.concat(coordinates, ignore_index=True)

plt.figure()
plt.scatter(coordinates_full_df["xI"], coordinates_full_df["year"], s= 5)
plt.show()

Without the z-transformation this gives me something like this: download 13 Or this: download 14 When I z-transform the embedding for every year, I end up with this: download 15

So at the moment it is not as satisfying as the previous version, but it seems to work in principle. If you have any ideas about the code I would love to hear them. Otherwise thank you again for your help, and I will update if I make further progress.

lmcinnes commented 6 years ago

The z-transform seems to be helping, but I can't help but feel something is going astray here -- I would not in any way have expected the marching blocks effect seen in your first plot there. I don't see anything obviously wrong in your code, so now I'm a little suspicious in general of the approach I suggested. I'll try to think about it some more and see if I can come up with anything better.

MNoichl commented 6 years ago

So, I never got the rolling-window-thing to work. I have tried various different combinations of standardization techniques to keep the window from shifting, but I never got a nice and coherent structure out of it. But the problem was not that central to the project, which I just finished. I have produced the graphic below as a final version.

Thank you for your time, if you want to utilize it as a use-case for umap, please do.

full_struct

lmcinnes commented 6 years ago

Thanks so much for all your work, and your patience! I'm sorry we didn't manage to get the rolling windows to work -- it's a use case I will try to keep in mind and hopefully fix/make available at some point in the future. The final result looks great, and I will be sure to reference it as a use case for UMAP. Thanks again!

On Fri, Sep 28, 2018 at 10:03 AM MNoichl notifications@github.com wrote:

So, I never got the rolling-window-thing to work. I have tried various different combinations of standardization techniques to keep the window from shifting, but I never got a nice and coherent structure out of it. But the problem was not that central to the project, which I just finished. I have produced the graphic below as a final version. https://homepage.univie.ac.at/noichlm94/posts/structure-of-recent-philosophy-iii/

Thank you for your time, if you want to utilize it as a use-case for umap, please do.

[image: full_struct] https://user-images.githubusercontent.com/34612138/46211407-c32c2a00-c333-11e8-91b1-0e2b3a491b91.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/137#issuecomment-425445678, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBW7ONlnWWcrVrUJwTZPwsnbdUgdTks5ufiwcgaJpZM4WbJUq .