Text mining activity (day 3)

owikle commented 4 years ago

Choose one of the tools we talked about today and spend some time exploring a text of your choice. Maybe you'd like to analyze your own writing, one of the corpora provided today, or a particular novel that you find on one of the text repository sites that we've talked about.

After you've finished exploring, reply to this GitHub Issue with your thoughts on the following:

What was your rationale for choosing the text analysis tool that you did?
What judgment calls did you have to make (such as choosing stopwords, number of topics produced, scope of collection, etc.), and how did they affect your use of your results as evidence?
Did you discover any insights? Limitations? Frustrations?

clamb89 commented 4 years ago

I used gutenberg.org to download a plain text file of an Emily Dickinson poetry collection and then uploaded the text (after doing some very basic cleaning) to Voyant.

I used gutenberg because it was easier for me to access a plain text file of the work. I originally looked for books on archive but (because of my inexperience) was unable to figure out how to download a file that was as easy to work with. It seemed like access to the various corpora was more limited? I chose the text because I spent all of the fall semester reading ED poetry and reading about her life and work, so I thought it would be fun to see how the topics generated in Voyant might align with the topics/themes that surfaced in the class.
The work had a lengthy preface that I decided to remove because it was written by someone else and would detract from the results (I originally uploaded the entire text and the results were of course drastically different). For poetry, this was a pretty large collection, which I figured was necessary in order to draw accurate conclusions about the work.
My biggest frustration is lack of experience. As mentioned above, I couldn't figure out how to download plain text from archive... But the exercise overall provided lots of insight into the process/concept of text mining and its possibilities, especially because I chose text that I was familiar with. Poetry is fun to explore here (when there's a large enough corpus) because of its direct language. "Little," "night," "day," "face," "time," "heaven," "soul," and "death" are among the top words in this Emily Dickinson collection.

jak487 commented 4 years ago

I also used Project Gutenberg to copy and then paste the full text of Marcel Proust's Swann's Way into Voyant. I chose Voyant in part because of the application's tagline and its promise to "see through your text." I wanted to pair this notion of textual transparency via quantification with a text that is notoriously difficult to quantify humanistically by virtue of its unusual syntax, complex use of metaphor, and nonlinear narrative: Monty Python All England Summarize Proust Competition

I was drawn to Voyant by the number of data visualization tools. I thought it would be interesting to see simple graphic representations of such an involuted plot/narrative.
I ran it using Voyant's auto-detect option for stopwords though I did find this nifty stopword list on GitHub. One of the words I was interested in mapping is 'like' which seems like a fairly common stopword. Because Proust uses simile/metaphor almost continuously, I wanted to know the frequency of that word.
The word 'like' turned out to be the second most common word in Swann's Way behind 'Swann'. I knew Proust's sentences were long but 38 words per sentence is pretty astonishing, especially when compared to Thoreau's 29 per sentence. Another insight I found interesting and somewhat beautiful was how the trend lines of Swann and his love interest, Odette, rise and fall together, whereas the line for the word 'time' remains at a constant through the document. This suggests a formal commitment by the author to exploring character through the concept of time, rather than simply writing about character.

dcnb commented 4 years ago

There's definitely beauty in that trend line. How strange and wonderful that is.

I'm struck to by how these tools so often reflect back to us what we know thru some means already (intuition, the form and its relation to others), but that we can't quite articulate. both in the topic modelling examples and in kindred britain, we can often reflect after the analysis or lineal line is presented on the likelihood of that pattern emerging but the dh tools help articulate and reinforce those discoveries. I'm fond of thinking of these tools, and DH generally, as an expansion of traditional humanistic inquiry, not a disruption or revolution -- the questions remain the same--what is this about? what structures are at play here? what is it 'like' to be human?--but the means and scale by which we ask them are changed.

Also thank you for the Monty Python. @owikle!

I also wonder if you compared whatever translation you used here against a different translation of proust, what you would see. The newer translations, of course, wouldn't be available. I think I read the Moncrief Kilmartin translation oh so many years ago (and just Swann's way, I've never gotten past that book, although i started the second (I think)). I've wanted to read the Lydia Davis translation, as I've heard it's good and I'm a fan of her stories.

Deep lunchtime thoughts! Thank you very much @jak487

Oh and the stopword list you found is for NLTK, which is a Python library of techniques, programs, etc. for dealing with language. Very dominant in most advanced applications analyzing text/language

dcnb commented 4 years ago

@clamb89 Dickinson!

I want to see that in Voyant. and I'm wondering how the different presentations of her work would appear differently. I'm assuming anything you got off of Gutenberg would be the Higginson/Loomis Todd presentations. It'd also be interesting to see if the different line breaks in the more recent presentations (Franklin) would change the way the tool saw the poems. Probably not, I'm guessing, which is interesting in its own right.

I don't know if it was listed this week, but the Dickinson Archive -- here's My life had stood a loaded gun -- is one of the best presentations of poetry/archives online. To be able to see all the editions (you have to hit the text button on the right) as well as the manuscript -- it just makes so much sense in terms of design. There's another interface trying to present a ton of information on one page. Seems to be common with these.

I was also wondering if the hyphen (or em dash) would be interesting to examine in Voyant. You'd have to remove it from the stopword list. It might be particularly interesting to see that in the "contexts" visualization.

And sorry you had an issue with InternetArchive. Finding those formats can be challenging. Especially because they put so many up there. Another case of TMI in the interface perhaps.

dcnb commented 4 years ago

Here's an article from the new yorker a few years back on the Internet Archive. I consider them one of the most important cultural organizations on earth. That sounds crazy when I write it, but I stand by it. I'm guessing @owikle and @evanwill feel similarly. But we're librarians ...

samsonmatthews commented 4 years ago

I did the follow up activity and then forgot to respond to this issue, sorry! I used archive.org to download a text file of Mary Shelley's Frankenstein. I used visual studio to get rid of chapter titles (sometimes referred to as 'letters') because I felt that these words would interfere with analysis of the text. I also decided to add 'letter' to the stop word list that Olivia provided, because that word is mentioned frequently in the section written as letters, and I felt it would skew my data (although looking back, maybe the inclusion of 'letters' could provide insight into the choice of syntax in structuring the first section as a collection of letters). This particular choice definitely heavily impacted the data that I received.

I actually decided to explore both Voyant and JsLDA. I really like Voyant because it has a lot of visual learning things that are really cool to look at, but also help me understand the relationships better (as a visual learner). Voyant proved to be quite the useful tool, there are so many different ways that analysis is conducted. I particularly liked the 'Dreamscape' tool, which connected locational words used in the text to geospatial mapping. Of course, it wasn't completely accurate, but an interesting visualization of the setting of the book, and where these fictional relationships took place. I think that Voyant is so vast in modes of visualization that it got a little overwhelming for me to use, and I had to switch gears for a while in order to keep myself on track. I'm not sure I love the user interface of Voyant, I wish I could get rid of screens and add others, make it a little bit more customizable (maybe you can do this, and I'm just inexperienced).

JsLDA, and topic modeling in general, I find fascinating because it offers perspectives that you really wouldn't be able to consider otherwise. I ran 600 iterations of the modeling, and got some really interesting results. My favorite set of words was as follows: "sun room found eyes lay body great upon ground covered". I don't know why I never considered this before, but it could be interesting to examine Frankenstein not only as one hailing the origin of science fiction and of canon, but also one examining environmental humanities, our own relationships to the environment around us (and our own bodies), and the way we cultivate communities. Of course these are themes of Frankenstein, but I never really extending these themes beyond conventional analysis. Thinking about and using text in this way definitely pushed my thought processes in really interesting directions. I think the most exciting thing about these tools for me is that they really are here for your own discovery, offer new perspectives and ways of approaching text and analysis.

thecdil / mini-symposium

Text mining activity (day 3) #3