Word Clouds (P3) - Githubissues

tafsiri commented 8 years ago

Requirements
- Brief Description
- Show words scaled by some metric in a word cloud like layout.
- Clarifications Required
- [ ] What metrics will be supported on the analysis side for input into this? (irrespective of how that is actually pushed to the vis—e.g. we may normalize all scores for this vis into a particular range).
- [ ] What differences from classic word clouds will we incorporate?
- [ ] What if any interactive controls are provided?
- [ ] What, if any multi-document requirements are there?

arnicas commented 8 years ago

Background So Far...

The most famous criticism of word clouds: http://www.niemanlab.org/2011/10/word-clouds-considered-harmful/

But it doesn't mention some of the issues I see too:

arbitrary ordering (makes it hard to see diffs),
size differences hard to make out (which one is second biggest?),
random colors that alter your reading of what's "important" (could be a useful signifier in a statsy context),
ngrams could be used for phrases to reduce the broken entities problem,
stop words often need tuning (but need to be reported in the UI imo - at least the "important" ones),
and lemmatizing makes for better count bins, but lemmas look bad in word clouds.

It is true that the word cloud doesn't replace having a meta-analysis of what the words signify, like categories/sentiment etc. But they can still suck less as an exploratory tool, IMO, and they're not really going away. And in the multi-document case, they can be quite useful, I think.

Trial 1, in html_js/wc_clouds_bars.html:

In both cases, the bar charts show that "think" is the second biggest word. You can see from the distributions in the bar charts that Democrats are more eager to think and Republicans are more eager to do than anything else (hah). Essentially, the bar chart info is useful, but it also takes up a lot of room like this. One idea I had was to make tiny distribution barcharts next to wordclouds to make it easier to see the different sizes, but then you need two visuals, the cloud and the counts. (Ideas? Tooltip on wordcloud with a small graph indicating where the word is in the distribution? Plus maybe context around it in concordance wordlist style?)

This is a possible other design that's easy in CSS - size words and order them, with superscript to give the actual count (superscripts here are colored by a scale too). The alphabetical ordering is ugly:

The combined list alphabetically:

Back to regular layout word clouds...

Word Cloud crappy looking proto-set behaviors - see html_js/animate_wclouds.html.

Side by side: Set operations mockup (except scales are wrong so it looks unpacked) showing the words NOT in the wordcloud that are in the other one:

An example from Drew Conway years ago that's like Nick's thing, I think http://drewconway.com/zia/2013/3/26/building-a-better-word-cloud:

Yeah, this is from Nick's work on CompareCloud http://vialab.science.uoit.ca/textvis2015/papers/Diakopoulos-textvis2015.pdf- it does have a lot of stuff I like:

This is very interesting on the set concept, but might be hard for people to get as is... http://bl.ocks.org/nitaku/8579a28a78ddd3391d6b

You can see why I suggested maybe a force layout approach.

In a multi-document context, tfidf word clouds are interesting, but need different stopwording - these clouds got much more interesting after doing tfidf to raise the importance of the very different words in each speech, instead of focusing attention on the common ones. (Candidate names and moderators had to go out in the stop words lists.) This is from wordclouds_with_shiffman_tfidf.html:

Btw, argument for small multiple word clouds in this paper/site: http://groups.inf.ed.ac.uk/cup/wordstorm/wordstorm.html but they use Wordle layouts with some important tweaks:

FYI: An experimental thingy by Marian - wordwanderer.org

That's the brain dump for now on what my thinking was. I guess some experiments and design ideas are needed here!

arnicas commented 8 years ago

Here's the one I left off from NYT, which is a good example of how to handle comparisons with overlap in vocab in a method that doesn't rely only on word size (bubble + color split):

tafsiri commented 8 years ago

In taking word clouds in a more analytical direction I like the ordered (by score/frequency) word cloud with score as super script (or subscript) and think the technique extends well to multiple documents. I find it harder to read when they are combined but maybe that was the word ordering. I also think this layout is closer to what many eyes had (pre wordle).

a78532b4-b306-11e5-8e08-454c9f0b4be4

The look of this reminded me a bit of wordcount.org by Jonathan Harris

A technique that i think could be interesting from the 'word storm' is using opacity/color to indicate relative scores of words in the collection, we could start by just showing unique tokens in each doc in a different way (bold maybe).

arnicas commented 8 years ago

Analytic style with bigrams - i didn't filter out the right specific stopwords, but you can see how they display here separated by the superscript only:

learntextvis / code-samples

Word Clouds (P3) #6

Background So Far...