Trajectory visualization

vincerubinetti commented 2 years ago

~This is still a WIP, but @danich1 take a look at the hover behavior. Right now, for collision, the labels are treated as circles, which is why you'll still see some overlap, but I'm working on doing it via rectangles, so it should look better soon.~

~I tried enlarging the nearby labels when the mouse passes over (and fading everything else), but it was still a bit hard to read because of the overlap. I also tried just doing that for the single label that is hovered, and that makes it more readable, but then you can only sort of see one label at a time.~

~So all that is to say, I think this is probably the best solution I could come up with.~

install lodash to make life easier
install d3 color and interpolate to offload some code
change font, change styling a little bit
refactor api code to make simpler with lodash util funcs
tweak frequency chart d3 code to be a little more efficient
extract wrapLines func to generic utility
add new "trajectory chart" in place of original umap idea, that shows a snake "path" through the years, with the top few words for each year
replace linear rgb color interpolation with nicer looking HCL color space interpolation
get rid of neighbors util funcs and move this functionality right into api code, and simplify with lodash
incorporate measuring actual rendered text width into wrapLines util func

netlify[bot] commented 2 years ago

Deploy Preview for word-lapse ready!

Name	Link
Latest commit	86a1f57b3358be4afd7b3f9d1d1819b75a709be0
Latest deploy log	https://app.netlify.com/sites/word-lapse/deploys/62793e0f99b24100095b091f
Deploy Preview	https://deploy-preview-49--word-lapse.netlify.app/
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

danich1 commented 2 years ago

Ah I see what you mean. I plus one the slider idea when we get to that point as trying to show 20 years worth will be very messy all at once. I don't know the max amount of years that can be shown, but something to keep in mind as you progress on this visualization.

vincerubinetti commented 2 years ago

Ah I forgot that there could be that many years. For this PR I've just been using that sample json you gave me, not the real API endpoint yet. I'll switch that back in now.

I can certainly do the slider idea, but in my mind that idea was kind of just for fun, so you could see an animation. I don't know if that would help the readability. It would cut the number of clusters in the above screenshot down, but there would still be overlap in the words.

This is a tough situation as there's just so much info to show at once. I'll keep working at it.

vincerubinetti commented 2 years ago

Just updated it to use (approximate, based on string char length) rectangles for collision instead of circles. It looks a little better. But then there is a new problem, these tokens get very long and wide, so when they expand, they explode way off to the left and right of the plot. Also, the d3-bboxCollide library I'm using has a lot of problems and isn't maintained. I could make my own bbox collision -- I get how I would do it -- but it's quite a bit of extra work.

Here are some other ideas:

Just make all the text really small, to the point where no neighbor text overlaps, and then allow the user to zoom in with mouse wheel or pinch zoom (just on the SVG, not zooming in on the whole browser). Doing this, I could also un-overlap them all with the physics, permanently, just once, so no expanding on hover. (I can't do this now because if I do, they all expand so far away from their original point that you can't see the high-level cluster structures anymore).
Is there any way we can determine just a few of the neighbors to show at full size, and make the other ones much smaller and fainter? Maybe we can make species_ (tagged?) neighbors smaller as they're less likely to be meaningful to the user? Maybe we can only show the top ~10 or so for each year, sorted by closest to the year dot (searched word), then expand those on hover?

danich1 commented 2 years ago

Is there any way we can determine just a few of the neighbors to show at full size, and make the other ones much smaller and fainter? Maybe we can make species_ (tagged?) neighbors smaller as they're less likely to be meaningful to the user? Maybe we can only show the top ~10 or so for each year, sorted by closest to the year dot (searched word), then expand those on hover?

It's getting there. Well this information isn't included but each word has a score of similarity where higher numbers means more similar. We could drop to the top 10 highest scoring words and the adjust the font size so the visuals look a lot better. How does that sound as a solution?

vincerubinetti commented 2 years ago

It might be better if the backend could return the score along with the token/year/x/y/etc so I can filter it on the frontend, to allow for more flexibility. 10 might be too much or too little. Or in the future I might be able to determine the number dynamically somehow based on the crowdedness of the graph.

danich1 commented 2 years ago

It might be better if the backend could return the score along with the token/year/x/y/etc so I can filter it on the frontend, to allow for more flexibility. 10 might be too much or too little. Or in the future I might be able to determine the number dynamically somehow based on the crowdedness of the graph.

Tagging @falquaddoomi to see if you'd like to make this change.

cgreene commented 2 years ago

Ok - this is super wild as a visualization, but without being placed within the broader scale of words it looks like things are bouncing all over. Is it possible to have the umap coordinates used be the total bounding box of the umap space of all words/years?

vincerubinetti commented 2 years ago

I'm not sure I understand what you mean. Are you're talking about the words expanding and going outside the bounds of the SVG and also overlapping other clusters of words, like this:

I could certainly update the bounding box of the whole SVG when things expand, but that might be kind of jarring. The over-expansion was actually less of a problem when I was treating the labels as circles instead of rectangles, but then there was more overlap in the words.

Is it possible to have the umap coordinates used be the total bounding box of the umap space of all words/years?

This is already what is being done, basically. When you're not hovering, everything is in its proper coordinates, and the dimensions of the space match the min/max x/y of all the years and words. The expand-on-hover effect is just to push them apart so you can read them better.

cgreene commented 2 years ago

We're currently restricted to a zoomed-in view (the space over which this word is observed) of a much larger umap space (the space over which any word is observed). Without some of that broader context, the word is changing dramatically, while it might only be making small changes in the broader space. What we're missing is the thing in preprint similarity search where other preprints fall to provide context for this one.

cgreene commented 2 years ago

Oh - wait - I might have misread. If this is true:

When you're not hovering, everything is in its proper coordinates, and the dimensions of the space match the min/max x/y of all the years and words

Then something is wrong with the underlying data.

cgreene commented 2 years ago

Ok - are the dimensions of the space coming from the immediate neighbors of this word (essentially its close in neighborhood), or is it from all words?

danich1 commented 2 years ago

Ok - are the dimensions of the space coming from the immediate neighbors of this word (essentially its close in neighborhood), or is it from all words?

So the coordinates are generated from all neighbors to the query word. For example, pandemic gives 25 neighbors in each year and a UMAP model is trained on only those words (including pandemic). If i'm reading correctly, using all words at once would make these erratic changes appear a whole lot smaller.

vincerubinetti commented 2 years ago

Oh - wait - I might have misread. If this is true:

When you're not hovering, everything is in its proper coordinates, and the dimensions of the space match the min/max x/y of all the years and words

Then something is wrong with the underlying data.

Here when I said "all years and words" I just meant all of the x/y coordinates I'm receiving from the backend, which I guess is just the local neighborhood in this case.

Casey it sounds like you mean having the size of the space (i.e. the range) be determined from all words in the model? Wouldn't that just my screenshots in my above posts appear as small blips in a sea of white space, unless you mean to also include some or all of the complete set of words in the model?

using all words at once would make these erratic changes appear a whole lot smaller.

What are the "erratic changes"? If we're talking about the exploding that happens on hover, let's ignore that for this conversation; that has nothing to do with the data and is just a visual effect so to speak. Or do you mean how in the "pandemic" screenshot above, the trajectory arrow path is kind of "tangled".

I'm very confused by all of this. Perhaps a quick zoom meeting is in order?

cgreene commented 2 years ago

I mean the range being determined by all words in the model. I think they would appear as small blips in a sea of white space, but it would be a realistic representation of the amount that things change from year to year. This is not about the animation. Right now, it gives the perception that there's no consistency in what a word means from year to year.

vincerubinetti commented 2 years ago

Take a look at the new "trajectory" viz that we talked about:

This will need the backend to return the neighbor results for each year sorted by strength/score, because I can only show the top 5-10 (and even with just that, the figure is still quite busy).

I also refactored and polished some other stuff up in the most recent commit.

falquaddoomi commented 2 years ago

Hey @vincerubinetti, is there a particular example where the words aren't returned in order of decreasing score? I ask because it seems from my spot-checking that the method we're using to query for neighbors to the target word, KeyedVectors.most_similar(), already seems to return the results in order of decreasing similarity to the target. That said, there's nothing in the documentation that says it does return it in that order, and I can easily sort the neighbors per year by similarity manually, so if there's a query for which it's not returning the results in that order I'll implement it, otherwise it seems like it's already done.

(That, or I'm misunderstanding the ask entirely, so please correct me if that's the case.)

vincerubinetti commented 2 years ago

Ah I didn't realize it was already doing that. 👍 Maybe you could add a note in the Swagger docs or something.

If that's the case, maybe @danich1 would like it if the score was returned with the word and I can show it in the tooltip?

danich1 commented 2 years ago

If that's the case, maybe @danich1 would like it if the score was returned with the word and I can show it in the tooltip?

Yeah it be nice to incorporate that information.

I ask because it seems from my spot-checking that the method we're using to query for neighbors to the target word, KeyedVectors.most_similar(), already seems to return the results in order of decreasing similarity to the target.

I can +1 on the already sorted return values. There isn't anything in the documents that guarantees it's sorted, but everytime I use the function it returns in sorted order.

vincerubinetti commented 2 years ago

Ignore my previous comment, I didn't realize the score was already being returned. When did that get put in? I'll have that info show in the tooltip.

falquaddoomi commented 2 years ago

@vincerubinetti, @danich1: currently the tooltip shows 'Tagged/Not tagged' for entries that have a tag or don't, but do you think there might be utility in showing what the tag is in the tooltip if it's present? For me, knowing the ontology term is useful, but I don't know who the audience is for the site.

danich1 commented 2 years ago

currently the tooltip shows 'Tagged/Not tagged' for entries that have a tag or don't, but do you think there might be utility in showing what the tag is in the tooltip if it's present?

This was mentioned in #39. The idea is to provide which tag the term is referring to and to have it link to a webpage that displays more information about the tag.

vincerubinetti commented 2 years ago

I was planning to add that in the next PR.

greenelab / word-lapse