publiclab / plots2

a collaborative knowledge-exchange platform in Rails; we welcome first-time contributors! :balloon:
https://publiclab.org
GNU General Public License v3.0
957 stars 1.83k forks source link

tag visualization of all tags #1502

Open ebarry opened 7 years ago

ebarry commented 7 years ago

This is a request for someone with access to editing special pages to add this visualization of tags from the beginning of time to November 2016 to the top of publiclab.org/tags

https://www.dropbox.com/s/s78g3ufhsav5xzo/plots_tag_graph_256_filtered.png?dl=0 plots_tag_graph_256_filtered

CC: @gretchengehrke @skilfullycurled

jywarren commented 7 years ago

Hi, Liz - i'm a bit reluctant to put a static graphic like this in our permanent codebase, but maybe a suggestion could be that we display a "feature" (like our banners) on the top of that page, and then admins could display whatever they want there. Would that work?

jywarren commented 7 years ago

It would go above or below this line: https://github.com/publiclab/plots2/blob/master/app/views/tag/index.html.erb#L4

And look like:

      <% cache('feature_tag-page-header') do %>
        <%= feature('tag-page-header') %>
      <% end %>
ebarry commented 7 years ago

Well, i don't so much want to decorate that page as i want to add "insight at a glance" . A different point, but maybe relevant as to why i'd suggest adding a graphic visualization is that this tag page still doesn't have any sorting capabilities to see "recent" or "popular" much less to see either of those by geography.

skilfullycurled commented 7 years ago

There are actually python gephi bindings which we could use to generate it dynamically. I'm actually working on a javascript network visualization right now, so let me see how that works out. If it goes well, then I can translate what I did into a python script which can generate the data structure to then be visualized in javascript.

jywarren commented 7 years ago

Hi, all - i think a generated graph would be great, and is something we could put in the permanent code.

@ebarry i'm not saying this is decoration and not content, i'm more saying this would go out of date quickly, and also our goal is to store /no/ content in our codebase -- only infrastructure. So this is just a way to implement it -- does my proposed solution sound OK?

re this tag page still doesn't have any sorting capabilities to see "recent" or "popular" much less to see either of those by geography. I'd be happy to work with you to come up with some feature requests to get contributors building to solve this if it's a priority for you. Could be some easy first-timers-only issues if you can help get them in the queue!

ebarry commented 6 years ago

Let's go back to basics on this issue :) What is the goal of visualizing tags?

For me, visualizing tags is a way to visually depict associated tags, e.g. tags that appear together on the same content. For great example, see the color-coded clusters in @skilfullycurled 's visualization above. Clustering tags are important because they visually connect the website's presentation of community activity closer to what the Public Lab community culturally refers to as "research areas", or perhaps "topics" --> this is my actual goal with this entire issue.

Here's some background information: on our tags page (https://publiclab.org/tags) we write "We use tags to group research by topic" and encourage people to browse tags (currently only sorted by recent activity). This is an important way that we name, link to, and/or promote people to find and engage with topics. The Dashboard itself emphasizes recent activity. The Dashboard now features a "recently used tags" bar -- which is an important but partial step to the goal of seeing "research areas" or "topics".

To move forward, I am not interested in navigating by a graphic tag visualization (so 2007!), however, the clusters of activity provide an important additional way of connecting/navigating to topics. To achieve the goal, by which i mean the ability for the tags page to show which are the most interconnected tags, to communicate the breadth of connected topics in a research area, to navigate/connect to a research area, and to subscribe appropriately we do not necessarily need color-coded swooping arrows. Let's think about how to achieve these goals.

We might also consider mirroring publiclab.org/tags at publiclab.org/topics to make the language more accessible.

jywarren commented 6 years ago

Cool, thanks Liz!

To try for one stab at a narrower feature towards this goal, what if tag pages (floating new name: topic pages...!?!) had a list of "Related topics", something like:

Related topics: water runoff wetlands turbidity

Where "related" means that (acknowledging that there are different ways to measure this, and that we want some "computationally efficient" way) these are the tags which most commonly appear on pages that already have the primary tag. So for the topic onions, we tally every page tagged with onions and take the top, say, five.

Small follow-up if the above sounds good -- would it be all right to do this solely for the most recent 20-30 pages? Even if this is just a starting point, that would make this easier to implement without worrying about it causing overall website slowness. There could be more complex ways around this, but this is the easiest way to get started.

jywarren commented 6 years ago

I cross-posted at https://publiclab.org/questions/tommystyles/10-20-2017/need-your-feedback-on-tag-pages -- what do you think about moving discussion over there until there are specific discrete coding steps (mini projects for code contributors) we can make?

ebarry commented 6 years ago

ok great! let's go over to that discussion and come back once we have doable steps.

--

+1 336-269-1539 / @lizbarry http://twitter.com/lizbarry / lizbarry.net

On Wed, Nov 15, 2017 at 9:54 PM, Jeffrey Warren notifications@github.com wrote:

I cross-posted at https://publiclab.org/questions/tommystyles/10-20- 2017/need-your-feedback-on-tag-pages -- what do you think about moving discussion over there until there are specific discrete coding steps (mini projects for code contributors) we can make?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/1502#issuecomment-344799932, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ2n8PdvpH0GQ_wBU-Utp4xfL7XDmuJks5s26PpgaJpZM4OOvLP .

sagarpreet-chadha commented 6 years ago

@jywarren, @ebarry , is there any API (or maybe documentation) to know the 'edges' in the above graph ? I mean how are nodes connected ? Thanks 😄 !

skilfullycurled commented 6 years ago

Hey @sagarpreet-chadha!

The visualization is just an image so there's no API (yet! wink) however I can provide you with the list of edges from that particular graph. The most "raw" file formats would be csv and json. Both formats should work with a graph either "programmatically" (iGraph, networkx, d3.js) or with a GUI (Gephi, Cytoscape).

Apparently you can't upload files on github. I tried to upload them to the Public Lab research note but it's not working. @jywarren is there a way to upload files to a research note? If not, @sagarpreet-chadha, can you make a post in the plots-dev googlegroup (you can sign up here if you're not already)? Let's wait to see what @jywarren says because it would be great to have them directly in the research note.

Here's what you can look forward to though:

plots_tag_communities_edges_w_props_9_16.csv: : list of unique edges with calculated properties, in particular the weight of the edge. The weight translates to the number of times the tags occurred together.

plots_tag_communities_nodes_w_props_9_16.csv: list of nodes with calculated properties. Most relevant to the image on the website the "modularity class" which tells you to which community each node belongs.

plots_tag_communities_9_16.json: I don't find json as useful but I know some people prefer it. I think the json file also includes properties for the visualization that's on the website (i.e. RGB color of each node).

skilfullycurled commented 6 years ago

Update: removed plots_tag_communities_edgelist_9_16.csv from list of files above. This file is of limited use because the duplicate edges had already been merged into unique edges with weights. Without the properties, this edge list will only allow you to build a graph with edge weights of 1. I'll look for the original file with the duplicates.

sagarpreet-chadha commented 6 years ago

Thank you @skilfullycurled for your reply !

I was actually trying to build the visualization graph using javascript library (d3.js or vis.js) so that it could be easily added to publiclab.org website . These libraries require the data in the form of :

nodes: [ { id: 1, shape: 'circle', label: 'Infrared } ] for nodes .

And for edges :
edges: [ {from: 1, to: 2}, {from: 1, to: 3}]

Well json would be great otherwise i can create it , or maybe create a Javascript object directly (in this way no need of parsing the JSON file) .

I have created a dummy graph (we can play with the nodes and the edges here 😄 ):

screen shot 2018-01-24 at 3 40 16 pm

What do you think ? @ebarry , @jywarren , @skilfullycurled

skilfullycurled commented 6 years ago

Ah. That would be awesome! Okay. To further this conversation, we'll need to leave "API-land" and move into into how the visualization in Gephi works and the best way to translate those features into javascript.

Can I trouble you to start this as a question? Something like, "How can I translate the tag visualization created in Gephi into a javascript version?"

Also, shoot me an email at benj.sugar@organizers.publiclab.org so I can share the files. I'll remove my email once you do.

jywarren commented 6 years ago

Actually i think we may not need to leave API-land -- the existing API is pretty robust these days. I'm curious @skilfullycurled how you generated those edges --

could they be generated fresh from a list of all tags and the nodes they've been used on? That is a reasonable query for us to generate, if cached.

We could add it to the API at https://github.com/publiclab/plots2/tree/master/app/api/srch and document it at https://github.com/publiclab/plots2/blob/master/doc/API.md

If it is enough data, the query could be something like:

r = []
Tag.select(:name, :tid).each do |t|
  nids = t.nodes.select(:nid, :status).where(status: 1).collect(&:nid)
  r << [t.name, nids] if nids.length > 0
end
r # later, r.to_json

I just ran that on production and it took about 15 seconds. If we cache that daily, I think it's manageable, and we might be able to improve it further.

jywarren commented 6 years ago

Also you can share files at http://gist.github.com -- could that work?

jywarren commented 6 years ago

So, using the JSON generated from my query,

Here's an excerpt:

["whitebalance", [12476, 13575]], ["wi", [12143, 13067]], ["wi-fi", [11123]], ["width-of-dvd-grating", [12838, 12875, 12895, 12899, 12902, 12926, 12990, 12991, 12995, 12999, 13006, 13014, 13019, 13037, 13046, 13057, 13062, 13069, 13077, 13088, 13089, 13094, 13103, 13117, 13125, 13131, 13133, 13136, 13152, 13154, 13157, 13159, 13169, 13178, 13181, 13183, 13188, 13226, 13248, 13283, 13302, 13305, 13308, 13315, 13316, 13340, 13349, 13355, 13366, 13401, 13402, 13409, 13414, 13423, 13429, 13432, 13434, 13437, 13439, 13440, 13443]], ["wiki", [9048, 10956]], ["wiki-gardening", [10956]], ["wild", [11707, 11711]], ["wildfires", [14803]], ["wildlife", [670]], ["wilkinson-bay", [220, 265, 280, 281, 282, 283, 284, 677]], ["wilkinsonbay", [606]], ["williamsburg", [10343, 10428, 10444]], ["willow", [9979]], ["wind", [9032, 10660, 12610, 13880, 14487, 14527, 14530, 14531, 14713, 14756]], ["wind-direction", [14527]], ["wind-sensor", [14713]], ["wind-speed-meter", [1962, 5837, 9032, 12103, 13064, 13165, 13231, 13880, 14527]], ["winder", [7717]], ["winders", [1900]], ["window", [147, 1759]], ["windows", [11434, 11677, 13037]], ["windows-7", [13037]], ["windows-7-ultimate", [13037]], ["windows-excel", [13037]], ["windspeed", [745]], ["windvane", [14527]], ["windy", [146]], ["wine", [706, 10955]], ["winter", [5161]], ["wintercamp", [5103]], ["wired", [10315]], ["wireframes", [10623]], ["wireless", [3908, 9940, 11123, 12175]], ["wisconsin", [10504, 10552, 10611, 10619, 11331, 11783, 12142, 12143, 12192, 12221, 12337, 12537, 12539, 12562, 12597, 12610, 12919, 13067, 13216, 13217, 13219, 13222, 13223, 13224, 13406, 13578, 13920, 13921, 13922, 14018, 14044, 14087, 14146, 14648]], ["with", [11772, 13742, 14728]], ["with:abdul", [13407, 13412, 13413, 13428, 13493]], ["with:adam-griffith", [11049]], ["with:amal", [12161]], ["with:amandaf", [11556]], ["with:amberwise", [12338, 13280]], ["with:ann", [12850]], ["with:basurama", [11699, 11705]], ["with:becki", [13571]], ["with:bronwen", [10952, 12480, 13493, 14587]], ["with:bsugar", [13449]], ["with:btbonval", [11789]], ["with:cfastie", [11688, 13493, 13980]], ["with:chrisjob", [10464]], ["with:cindy_excites", [11566, 11567, 14537]], ["with:damarquis", [12338]], ["with:danbeavers", [11417, 11567]],
jywarren commented 6 years ago

FWIW there may be some even more efficient query like this but this is pretty decent, although doesn't return fully what's above:

Tag.select('term_data.tid, term_data.name, community_tags.nid, community_tags.tid')
   .includes(:node_tag)
   .references(:node_tag)

Although this wouldn't tell us if the node was published (vs. spam) unless we also mixed node.status in there. But that's possible!

sagarpreet-chadha commented 6 years ago

Hi , i have just few questions here , 1.) If 2 tags belong to same node , they have an edge between them ? 2.) The different colors is for different types of node like questions , notes , research-notes , etc . ?

Thank you 😄 !

sagarpreet-chadha commented 6 years ago

And i also agree with not leaving the API -land :)

skilfullycurled commented 6 years ago

Arg! Okay. Let's not pile on, please. No one wants to stay in API-land more than I do (well, perhaps with the exception of @ebarry ). In my understanding the building of API-land had all but been delayed indefinitely due to concerns over website sluggishness (see extension of conversation here). But now @jywarren is saying it isn't as big a deal anymore, so good times on that end.

Since using Github can be a barrier to accessible information (not everyone has access, knows how to use), I think (er...thought) having conversations that aren't about "getting things done" in the codebase were better relegated to the website where everyone can learn from them. These aren't community norms I set (see @jywarren's own comment above) but I do think they are good ones.

jywarren commented 6 years ago

Oops, sorry @skilfullycurled I hadn't remembered your last comment on that thread -- https://publiclab.org/questions/tommystyles/10-20-2017/need-your-feedback-on-tag-pages#answer-556-comment-17709 -- where you suggested:

  1. only running on the top 250 tags
  2. caching weekly

I'll ping in back over there, but I think that with all the work on the API, code cleanup and outreach, we could do a daily or weekly cached version of such a query, and be OK with 10-15 seconds total compute time per week. The rest would be run locally in the browser. Repeating this over there.

skilfullycurled commented 6 years ago

@jywarren I'll need to get back to you on some of your questions. I'll post my jupyter notebook later. In the meantime, see here for a brief explanation of how the graph is created from the tag pairs. For exact code, see here.

@sagarpreet-chadha (and anyone else who's interested) you can see how a d3.js graph was created from the tag data by checking out the repo for tagoverflow which was the inspiration for this project.

Regarding the community detection, if you look in the tagoverflow repository you'll find that the author implemented their own algorithm. Since that time, others have been implemented such as jLouvain, netClustering a CNM implementation (d3 example). With a limit of 256 tags, they community detection is probably fine in browser.

jywarren commented 6 years ago

So as not to overwhelm the publiclab.org discussion with lots of data, here's a link to the format of data TagOverflow uses:

https://api.stackexchange.com/2.1/tags/python/related/?site=stackoverflow&key=of3hmyFapahonChi8EED6g((&pagesize=16

It makes like 15 calls to fetch what tags relate to a given tag (in the above example, "python")

jywarren commented 6 years ago

So the difference between that and the data I generated above is that my query lists the node ids, but hasn't used them to establish "relatedness". But of course @skilfullycurled's Jupyter notebook does this! Cool, thanks for sharing!

skilfullycurled commented 6 years ago

@sagarpreet-chadha, I posted a question that asked and answered your questions above:

https://publiclab.org/questions/bsugar/01-25-2018/how-was-the-tag-graph-visualization-made

I'm not trying to be "passive aggressive" about my request, but I think people could benefit from this aspect of the conversation being public. So I guess that makes it "aggressive aggressive". ; )

All kidding aside, happy to answer any questions!

skilfullycurled commented 6 years ago

Hey everyone!

@sagarpreet-chadha, I put all of the files you'll need here:

https://spideroak.com/browse/share/skilfullyshared/plots-tag-graph

The folder comes with a readme file which explains the contents.

Please let me know when you have downloaded them so I can close the shareroom. Eventually, I'll post them to my github account for other people to have access to on the wiki.

Happy to answer any further questions you might have!

sagarpreet-chadha commented 6 years ago

Thank You @skilfullycurled ! I have downloaded the files :-)

skilfullycurled commented 6 years ago

No problem @sagarpreet-chadha!

PS: I left you a follow up thought back in the wiki question.

jywarren commented 5 years ago

Great update on ruby based tag relatedness calculations here: https://publiclab.org/questions/bsugar/01-25-2018/how-was-the-tag-graph-visualization-made

more soon!

jywarren commented 5 years ago

Some progress in https://github.com/publiclab/plots2/pull/4657, where I implemented an extremely basic, but live instance of Cytoscape.js (http://js.cytoscape.org/), running off of a weekly cached collection of

It took over 50 seconds to run for ALL tags on the site (which could be cached weekly) but that also generated 8200+ tags and 31k edges... which is a lot to graph. Here's the full set; i think it includes plenty of spam tags: https://gist.github.com/jywarren/4b1f9a032092a8187dd802a375fcb700

You can specify the # of tags you want to query like this: https://stable.publiclab.org/tag/graph.json?limit=10 (once fully published, https://publiclab.org/tag/graph.json?limit=10)

It's currently limited to 5 "edges" per tagname, representing the 5 tags that occur most often alongside the original tag.

This is now live on the stable test server (although this branch rebuilds pretty often so the URL isn't always online... ironically) here:

https://stable.publiclab.org/stats/graph?limit=75

The larger counts like limit=100 or 250 seem to be showing some kind of error and I have to chase that down a bit. But this is a pretty good start.

There are LOTS of configurations that can be added to refine this -- node size, link strength, much much more -- check out the gallery at http://js.cytoscape.org for some possibilities. And making "families" may be possible too, though I'd need a bit more input for that.

jywarren commented 5 years ago

image

jywarren commented 5 years ago

Ooh, https://stable.publiclab.org/stats/graph?limit=300 seems to work too

jywarren commented 5 years ago

Community detection here! https://github.com/upphiminn/jLouvain/blob/master/README.md

sagarpreet-chadha commented 5 years ago

@jywarren , Super cool !!!

jywarren commented 5 years ago

Also there are a range of clustering algorithms - these can be tested in the JavaScript console:

http://js.cytoscape.org/#collection/clustering

I'm not familiar with these but they all seem to use attributes of the nodes or edges to create clusters of similar elements. So, what should we give as attributes upon which to base similarity?

You can try these in the console using the examples in the docs, things like:

var clusters = cy.elements().hca({
  mode: 'threshold',
  threshold: 5,
  attributes: [
    function( node ){ return node.data('count'); }
  ]
});
clusters; // <= then inspect what this returns to see the clusters
jywarren commented 5 years ago

OK, using jlouvain I was able to add community detection: https://github.com/upphiminn/jLouvain

I don't have enough test data to see how this'll work but if #4679 passes, i'll merge it and we should be able to see it running with community detection at:

https://stable.publiclab.org/stats/graph?limit=101

(once it builds)

skilfullycurled commented 5 years ago

Hey everyone! Looking awesome. Sorry I haven't been able to reply, catching up on somethings and will return to this later today.

In the meantime, another ingredient which I don't think I mentioned in any of my other posts is the layout. The one closest to what I used is probably the force layout. Technically it may have been something called force layout 2:

Force layout is sort of an annealing attraction/repulsion that reaches a steady state based on the parameters you set (i.e. the number of iterations, strength of attraction/repulsion). Here's a d3 demo.

As for the community detection and the edge weights you have a few options but if you want to recreate that tag graph this is in reference to, then you need co-occurrence which cytoscape, as fortune would have it, has a function to help make easier.

oe_ratio =  (all_questions_count * tag_count_AB) / (tag_count_A * tag_count_B)

Where tag_count_AB = edges.parallelEdges()

As it was, I first narrowed down the set of tags to some reasonable number (say, top 512), but then I narrowed down the tags I used for the visualization by only including the top n tags (maybe 64?) with an observed to expected ratio above 1.

You can read more from Tag Overflow. This method is one way to take care of the issue where an edge or node node may be important but of low usage. For example, at a store 100 people might have a 85% probability of buying coffee and cream, but five of those people always purchase coffee, cream, and eggs. So I definitely want to keep 5 cartons of eggs in stock.

An easy alternative is just to make the edge weight between two nodes the tag_count_AB and only take edges/nodes above a given threshold. Personally, I rarely get good results with this due to the reason above.

Regarding the other methods, you may be interested in pg 3. (2.2) to - pg. 7 (3.1) of this paper (no math for these parts) which attempts to classify the different types of community detection methods. This has helped me to choose ones that provide the most salient results given how I've structured the graph and what I want to know from it. For example, communities of common social connections vs. communities based on how frequently messages are sent between two people.

jywarren commented 5 years ago

Working now on stable server!

https://stable.publiclab.org/stats/graph?limit=99

jywarren commented 5 years ago

screenshot_20190125-103234

Here w 99 top tags!

jywarren commented 5 years ago

It should be running on the live site by later tonight, but i wanted to note that "overuse" of tags by some users has skewed the graph in a way that we've recognized before. I believe one of the users has been moderated from the site, and I wondered if folks thought it appropriate to either delete those tags from the site or to at least omit them from the graph. Deleting them would be easier but we can also craft something to just obscure them. Preference, @ebarry @skilfullycurled ?

Still, this looks nice even though the settings on edge elasticity still need some tweaking, and maybe a different layout type would work better...

image

skilfullycurled commented 5 years ago

Yup! We have definitely encountered this problem. Unfortunately, the only thing to do was to remove that particular user as an outlier. Someone using that many tags may not be an outlier in and of itself but if they are creating tags that are so specific to themselves and using them over and over again, then it's not really capturing the data.

skilfullycurled commented 5 years ago

I think I even logged a github issue with a feature request that popped up a warning that would in essence say, "Whoaaaaaaaa, easy there fella! Looks like you've got yourself a lot of tags there, eh?".

skilfullycurled commented 5 years ago

Oh, PS. Looking awesome by the way!!

ebarry commented 5 years ago

AAAAAAHHHHHHHHMAYZINGGGGGGGGGG!!!!!!!!!!!! Yes to manually "remov[ing] that particular user as an outlier"

skilfullycurled commented 5 years ago

I just keep returning to this thread because of how awesome it is and thinking of things (hopefully tiny). Another thing you might consider filtering are the power tags (those are the ones with the colons, right?). I think as soon as the tag overuse issue is rectified, then we'll know more about layout.

Note to self: here's a link to a commit with the pages that are important to the implementation.

jywarren commented 5 years ago

Hi all, glad for the enthusiasm! I got sick but a recovering now and will work a bit on this on the flight home on Tuesday.

I did want to ask - my specific question is whether we should:

  1. actually delete tags from this moderated user, or
  2. if we should try to preserve them but filter them out.

Filtering would be considerably more work both to code and for the database calls, but is possible.

skilfullycurled commented 5 years ago

In cases such as this one where an account has been made "inactive" due to moderation, then I think it's fine to just delete the tags from the database outright. Especially if you have a backup. Not because you might want to restore it, just because I have anxiety about losing data forever. It's not healthy, but cheap space is an unfortunate enabler. My feelings would be more complicated if this was an account that was made "inactive" by choice but we can discuss that another time (or now).

ebarry commented 5 years ago

Yeah this is a big topic to contemplate. After reviewing if there are tags that only this user has used (example: aries city-point), i found that actually there are very few tags completely isolated to this user (even purelab was originally used by Shan He about DIY water filtering, and research-notes was originally used on posts discussing the design of research notes on the website).

Since this user is moderated, can our tag visualization exclude all content from moderated users -- and by extension the tags used on that person's content -- without excluding that tag in general as it may be used on other people's content?

skilfullycurled commented 5 years ago

@ebarry, I should clarify (in case it wasn't).

When I said:

delete the tags from the database outright

I meant what you closed with:

...[that] our tag visualization [will] exclude all content from moderated users -- and by extension the tags used on that person's content -- without excluding that tag in general [since] it may be used on other people's content...

If the moderated user and Shan He both used the tag "purelab", "purelab" wouldn't be deleted, just any instance of the tag from the moderated user or, ITMU's, if you will.

The remaining question (if I'm understanding @jywarren) is whether or not to delete these ITMU's from the database entirely, or do we keep them in the database but filter the ITMU's out when all of the tags are requested for the visualization. Deleting them makes life much easier for those implementing the visualization, but there may be arguments for preserving them.

Personally, I think the former is okay when the user has been moderated because there is no chance that the content will ever return to the site. However, this might be different if a user chooses to delete their account based upon whether or not there is any functionality where they can reactivate it. I think we can leave that situation for another time but for the record I just wanted to say my judicial opinion is limited in scope.