code-samples repo and project notes
draft code to communicate ideas.
Please read the proposal doc for audience and intent: https://docs.google.com/document/d/1aLM0y56zCDUqUd6NRyiL0Nl9HPbKognBvK0aczcq3_4/edit?usp=sharing
(this is a copy that can be commented on)
Would recommend playing with Lexos if you haven't.
Notes & ToDo List
Goal: single doc analysis/vis but also comparative multi-document in as many cases as possible.
"Analysis" to offer
Counting Things
- characters (maybe - for punctuation in particular)
- words / lemmas
- sentences
- POS
- bigrams -- needed as part of the tokenization step
StopWord Handling!
- We should always display the stop words used in the UI -- because they are a choice and carry consequences.
- Start simple with an inline variable, then file, then tool in browser to update them?
Vis using the analysis/counts above
-
structure of page/doc? -- is this interesting? might be a good newbie view.
- (does this make sense in multi-document cases?)
- option to highlight items "inline"
- Examples:
- Ben Fry's Darwin thing
- Art things on my pinterest
- My example in html_js/book_shape.html - which doesn't work for line preceding spaces yet (e.g., for tabs, indents, poetry, etc)
-
count totals, count as percentage of whole
- bars
- word clouds and variants
- tf-idf for multi-document cases
- see examples like html_js/shiffman_tfidf.html, wordclouds_with_shiffman_tfidf.html
-
Word networks
-
Word Clouds: Let's investigate ways to make them less sucky/more tolerable.
- Get rid of random coloring or words - color only as indicator
- Idea: little bar charts beside wordclouds to show distribution of counts?
- Idea: ordered words with counts (mocked up)
- Extremely useful in small-multiple/multi-doc situations, but design issues.
- Tf-idf sizing is most interesting in that case. See:
- Show dynamic side-by-side and merged, with difference. Examples:
- merged: maybe a network style? also, that NYT bubble thing with circles.
- need good design for the combination operation for overlapping words when counts differ between the overlaps
- Single doc cases can become small-multiple if we allow word-clouds of POS, chars, etc.
- Ideally perform bigram analysis first!
- stop words are iterative process with word cloud displays
-
Timeseries
- show location of word in doc over time (concordance view
- use windowsize (user-settable) and count, show over time
- examples are wordclouds per chapter in a book, in order
- who talks when in a debate / play
- the Tarantino obscenity chart in 538. We should be able to make that.
- Simple example in timeseries.html shows just words per chapter in order in Emma.
- Vis types over time - bar, line, even word cloud??
-
Clustering docs
- see hclust code - actually, this is pretty bad in python in that it requires a bunch of libs. R's is easier. (See my notebook code in python dir using Pattern. It's easier to use scikit-learn but harder to explain to newbies. However, outputting the data from Pattern is awkward. Maybe NLTK is the best approach, I think you can save out numpy arrays easily?)
- R examples: https://eight2late.wordpress.com/2015/07/22/a-gentle-introduction-to-cluster-analysis-using-r/, mine in: R/tm_clustering_example.R
- tree view for first version? see output from R networkD3 dendrogram in html_js/networkD3_hclust_output_from_R.html
- Ideally the tree clustering allows collapsible nodes for large trees, and customizable labels on the edges.
How about a Structure Like This for the Site...
1) Getting Setup:
- Options for R, Python, Pure JS
- (Explain cons of just in-browser JS)
2) Shapes of texts
3) Concordance-views (simple search) -- keywords in context
4) Tokenization, Simple Counts, stop words
[We explain fancy tokenization happens in the python/r scripts, simple space sep can be done in js.]
Word networks come here - and bigrams/n-grams.
5) Parts of speech, lemmatization - show how counts change, what POS gets you.
6) Word clouds -- this basically sets us up for doing tf-idf because simple counts are bad for comparative documents, but tf-idf is better
- includes a variety of word cloud types -- bubble/networks, regular, maybe my ordered count css version
7) Time Series - breaking a text into sections, using multiple texts that have time ordering
- Maybe simple sentiment via polarity word lists I added to the repo too.
8) Clustering (if we get to it, or I can add it later, if you help with the cluster-output-to-tree structure for js)
Languages for Discussion
Discuss: Scripts for pre-processing (python/r) and/or in-page analysis with js.
JS
- RiTa for POS: http://www.ghostweather.com/files/image_replacement/; I haven't experimented with all of it's capabilities yet. Not sure I believe it can be as good as nltk/SpaCy and R's tm/NLP/SnowballC etc.
- shiffman's tf-idf in js (used in some of my examples)
- wordcount.js
- that Natural node/js lib had issues last time I used it, but Shiffman was issueing merge requests against it...
- I included some dude's textAnalysisSuite in the html_js/js dir, but I don't understand it's tfidf. The bigram thing looks interesting/useful.
- I don't really think the js tools are as good yet or as complete; and for sizable projects that would be slow in browser, would we give node instructions? (Argh)
Python
- Note that python stuff generally needs downloads. SpaCy and NLTK both do.
- I can make command-line scripts that prep data as json if we want that... My example script in preprocess_files.py was for another class, and i used it here to get POS for the word cloud experiments
- Need to well-document and test the install&run instructions for someone newbie
- Requires command line expertise
- SpaCy might be a good lib to use, I have used it for word2vec related things but haven't checked carefully the contrast with NLTK/pattern. It won't do tf-idf for us.
R
- Is this simplest actually? scripts that can be run from RStudio or command line?
- Still requires certain packages
- My R is rusty but Jim's is good I bet
Discuss: Should we have some kind of consistent format like JSON for output from the scripts? Or csv, which is easier for newbies. Configuration settings could be in a simple JSON file separate from the data files.
ToDo: Compare the accuracy and quality of the results in the 3 languagues and multiple libraries in Python/R. There are a lot. I can do that.)
Data Sets ToDo's
- Need to act a play and/or script
- Poetry examples
- Maybe recipes? some very different genre...