Closed MNoichl closed 6 years ago
First of all, thanks for the link to your work -- it looks awesome!
As to your question, I am not sure I have a solution that completely fits your needs, but it may be sufficient.
You could embed the first 5 years, then transform the 6th. Then take the embedding of the points for all but the first year and save it off. The do an embedding for the next window (years 2-6) but use the ability to provide an initialisation in the form of pre-specified points (you can pass init=<numpy array>
) to initialise that embedding with the positions for those points that you saved off. You can then transform in the 7th year, save off the appropriate embedding data and repeat the process, sliding along. That's rather compute intensive, so you might want to consider larger windows and longer slides if it is getting to be too much, but it will mostly do the job.
So the downside is that it won't hold the embedding fixed throughout -- it will evolve over time. On the other hand by setting the initialisation to the previous embedding you should have fairly consistent embedding from window to window so it will drift rather than change dramatically each time.
I'm not sure if it meets your needs, but hopefully it might work.
Please let me know how your project ends -- it looks intriguing!
Thank you so much for the encouragement and the comment. I came up with this code that I think implements what you describe. Weirdly my Jupyter-Kernel (and nteract, and running as .py) dies somewhere in the loop around 70% of the time for window_size=1, and 95% of the time for window_size<1.
Code:
import umap
from scipy import stats
n_neighbors = 10
window_size = 2
l = list(np.where(drc["year"] - drc["year"].min() <= window_size)[0]) #drc is where I keep the bibliographic data
L = M[l] #M is the sparse Matrix where I keep the vectors
trans = umap.UMAP(n_components=1, n_neighbors=n_neighbors, min_dist=0.01, metric='cosine').fit(L.toarray())
coordinates = []
for year in range(int(drc["year"].min()+window_size),int(drc["year"].max()-window_size)):
l = list(np.where(drc["year"] == year)[0])
L = M[l]
z = []
for w in range(0,window_size):
k = list(np.where(drc["year"] == year+w-1)[0])
z = z + k
new_list = M[z]
if len(l) > n_neighbors:
#Transform data on trans
old_embedding = trans.embedding_
#old_embedding = old_embedding[~np.isnan(old_embedding)]
embedding = stats.zscore(trans.transform(L.toarray()))
emb = pd.DataFrame(embedding)
emb.columns = ['xI']
emb["year"] = drc.iloc[l,:]["timestamp"].tolist()
coordinates.append(emb)
coordinates_full_df = pd.concat(coordinates, ignore_index=True)
coordinates_full_df.to_csv('yearly_output.csv')
#Renew trans:
while True: #We have to do this, as umap sometimes fails and only returns NaNs
trans = umap.UMAP(n_components=1,
n_neighbors=n_neighbors,#small => local, large => global: 5-50
min_dist=0.01, #small => local, large => global: 0.001-0.5
metric='cosine',
init = old_embedding).fit(new_list.toarray())
if np.isnan(trans.embedding_).any() ==True:
continue
else:
break
coordinates_full_df = pd.concat(coordinates, ignore_index=True)
plt.figure()
plt.scatter(coordinates_full_df["xI"], coordinates_full_df["year"], s= 5)
plt.show()
Without the z-transformation this gives me something like this: Or this: When I z-transform the embedding for every year, I end up with this:
So at the moment it is not as satisfying as the previous version, but it seems to work in principle. If you have any ideas about the code I would love to hear them. Otherwise thank you again for your help, and I will update if I make further progress.
The z-transform seems to be helping, but I can't help but feel something is going astray here -- I would not in any way have expected the marching blocks effect seen in your first plot there. I don't see anything obviously wrong in your code, so now I'm a little suspicious in general of the approach I suggested. I'll try to think about it some more and see if I can come up with anything better.
So, I never got the rolling-window-thing to work. I have tried various different combinations of standardization techniques to keep the window from shifting, but I never got a nice and coherent structure out of it. But the problem was not that central to the project, which I just finished. I have produced the graphic below as a final version.
Thank you for your time, if you want to utilize it as a use-case for umap, please do.
Thanks so much for all your work, and your patience! I'm sorry we didn't manage to get the rolling windows to work -- it's a use case I will try to keep in mind and hopefully fix/make available at some point in the future. The final result looks great, and I will be sure to reference it as a use case for UMAP. Thanks again!
On Fri, Sep 28, 2018 at 10:03 AM MNoichl notifications@github.com wrote:
So, I never got the rolling-window-thing to work. I have tried various different combinations of standardization techniques to keep the window from shifting, but I never got a nice and coherent structure out of it. But the problem was not that central to the project, which I just finished. I have produced the graphic below as a final version. https://homepage.univie.ac.at/noichlm94/posts/structure-of-recent-philosophy-iii/
Thank you for your time, if you want to utilize it as a use-case for umap, please do.
[image: full_struct] https://user-images.githubusercontent.com/34612138/46211407-c32c2a00-c333-11e8-91b1-0e2b3a491b91.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/137#issuecomment-425445678, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBW7ONlnWWcrVrUJwTZPwsnbdUgdTks5ufiwcgaJpZM4WbJUq .
TLDR: Is there a way to implement rolling time windows in umap, where new samples are embedded and old ones (and their influence on the embedding) are removed?
Thank you so much for umap, it has been very, very useful for me.
I am mainly using it to process bibliograpic data of scientific papers. Example: https://homepage.univie.ac.at/noichlm94/full/structphil/index.html
I made visualizations with embeddings in two dimensions, and am currently working on temporal embeddings, as shown below. (Every point represents a paper, their distance is computed from co-citations.)
The second embedding was made by reducing to one dimension which was then spread out over several years. This is not really satisfactory but, we can already see how clusters fade out after some time.
I have now experimented with doing one embedding (_fit) of the whole data, and then embedding every year seperately (_transform).
This has greatly improved the results, but is not totally satifying, as we see every year from the perspective of the whole history. Example from a different project (please ignore the heading):
I think it would be better (and epistemically more sound) if I could embed every year in the year(s) directly before it. For that I would have to embedd, lets say, the first five years, transform the sixth according to them, and then somehow remove the first, and so on.
Is there any way of doing this, while keeping the whole embedding constant?