learntextvis / code-samples

draft code to communicate ideas
0 stars 0 forks source link

Word Clouds (P3) #6

Open iros opened 8 years ago

tafsiri commented 8 years ago
arnicas commented 8 years ago

Background So Far...

The most famous criticism of word clouds: http://www.niemanlab.org/2011/10/word-clouds-considered-harmful/

But it doesn't mention some of the issues I see too:

It is true that the word cloud doesn't replace having a meta-analysis of what the words signify, like categories/sentiment etc. But they can still suck less as an exploratory tool, IMO, and they're not really going away. And in the multi-document case, they can be quite useful, I think.

Trial 1, in html_js/wc_clouds_bars.html: image

In both cases, the bar charts show that "think" is the second biggest word. You can see from the distributions in the bar charts that Democrats are more eager to think and Republicans are more eager to do than anything else (hah). Essentially, the bar chart info is useful, but it also takes up a lot of room like this. One idea I had was to make tiny distribution barcharts next to wordclouds to make it easier to see the different sizes, but then you need two visuals, the cloud and the counts. (Ideas? Tooltip on wordcloud with a small graph indicating where the word is in the distribution? Plus maybe context around it in concordance wordlist style?)

This is a possible other design that's easy in CSS - size words and order them, with superscript to give the actual count (superscripts here are colored by a scale too). image The alphabetical ordering is ugly: image

The combined list alphabetically: image

Back to regular layout word clouds...

Word Cloud crappy looking proto-set behaviors - see html_js/animate_wclouds.html.

Side by side: image Set operations mockup (except scales are wrong so it looks unpacked) showing the words NOT in the wordcloud that are in the other one: image

An example from Drew Conway years ago that's like Nick's thing, I think http://drewconway.com/zia/2013/3/26/building-a-better-word-cloud: image

Yeah, this is from Nick's work on CompareCloud http://vialab.science.uoit.ca/textvis2015/papers/Diakopoulos-textvis2015.pdf- it does have a lot of stuff I like: image

This is very interesting on the set concept, but might be hard for people to get as is... http://bl.ocks.org/nitaku/8579a28a78ddd3391d6b image

You can see why I suggested maybe a force layout approach.

In a multi-document context, tfidf word clouds are interesting, but need different stopwording - these clouds got much more interesting after doing tfidf to raise the importance of the very different words in each speech, instead of focusing attention on the common ones. (Candidate names and moderators had to go out in the stop words lists.) This is from wordclouds_with_shiffman_tfidf.html:

image

Btw, argument for small multiple word clouds in this paper/site: http://groups.inf.ed.ac.uk/cup/wordstorm/wordstorm.html but they use Wordle layouts with some important tweaks: image

FYI: An experimental thingy by Marian - wordwanderer.org image

That's the brain dump for now on what my thinking was. I guess some experiments and design ideas are needed here!

arnicas commented 8 years ago

Here's the one I left off from NYT, which is a good example of how to handle comparisons with overlap in vocab in a method that doesn't rely only on word size (bubble + color split): image

tafsiri commented 8 years ago

In taking word clouds in a more analytical direction I like the ordered (by score/frequency) word cloud with score as super script (or subscript) and think the technique extends well to multiple documents. I find it harder to read when they are combined but maybe that was the word ordering. I also think this layout is closer to what many eyes had (pre wordle).

a78532b4-b306-11e5-8e08-454c9f0b4be4

The look of this reminded me a bit of wordcount.org by Jonathan Harris

screen shot 2016-01-07 at 2 41 27 pm

A technique that i think could be interesting from the 'word storm' is using opacity/color to indicate relative scores of words in the collection, we could start by just showing unique tokens in each doc in a different way (bold maybe).

arnicas commented 8 years ago

Analytic style with bigrams - i didn't filter out the right specific stopwords, but you can see how they display here separated by the superscript only: image