jsoma / playfair-projects

Common repository of projects for Playfair
6 stars 32 forks source link

Analysing Movie Dialogues from the Cornell Corpus [Project3] #259

Closed HarshaDevulapalli closed 7 years ago

HarshaDevulapalli commented 8 years ago

draft1

The Cornell Corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. My Pitch Number is #109

Data Source : https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

Story issue checklist

My pitch was (use the number): _

HarshaDevulapalli commented 8 years ago

Massaging the Data took its own sweet time. This data did not have too much that could be visualised fancily. So I put together a lot of relatively interesting things as an infographic. It's still pretty chaotic - I would like to organise this a little better. Comments please!

playfairbot commented 8 years ago

Hi there, I'm the Playfair Bot!

Thanks for posting your story issue, but would you mind adding editing the original issue to add the first draft of your image? You have my sincere apologies, but it's easier for dumb robots like me when the comments are only used for updates.

Thanks! :pray:

ghost commented 8 years ago

Yay this is very, very cool! The fruit of your labor is finally visualized. My comments/suggestions below.

  1. I agree with @jsoma on the color vs. size hierarchy for the horizontal bars. Pick an order of colours and stick with it to make it more readable.
  2. Dante having 537 lines means nothing if we don't know the context. Is that a LOT more than average? wildly more? How to tell? I would make that a visual, actually. Maybe a scatter plot with x=hours of movie and y = total lines, and show his dot as an outlier. You have a lot of words already.
  3. The grids on your histograms look awful, so-- I would make them a lot tinier and really go for the grid if that is your intention-- or scrap them. entirely.
  4. The titles of your histograms are inaccurate. Both the histograms just show the frequency of movies within given percentage ranges. Maybe change the titles to "Movies with low percentage of female dialogue more abundant" or something.

In general, not sure I would opt for the "one big cluster of infographics on a page" approach. Maybe breaking them up into detailed visualizations that would be able to be inserted into an article/viewed on a page separately would make more sense. I'd certainly like to see more detail on all of the charts!

Very cool work. Also post your code.

gcgruen commented 8 years ago

I would also think of overall visual hierarchy of the whole composition. Everytime you talk about this project, there are two aspects that you highlight to everyone (man have twice as much speech time and women's most frequent terms are "Yes Sir") So I would suggest to make these bigger in relation to the other graphics.

Also (I like you fonts!) I would suggest to put in a thousand separator to make numbers more digestible (i.e. 6,020 instead of 6020) -- highlights visually the difference to those which are "only" hundreds.

Last but not least, I would suggest to also put your name on the whole thing ;)