thecdil / mini-symposium

mini symposium for CDIL fellows, held online May 2020
0 stars 0 forks source link

Text mining activity (day 3) #3

Open owikle opened 4 years ago

owikle commented 4 years ago

Choose one of the tools we talked about today and spend some time exploring a text of your choice. Maybe you'd like to analyze your own writing, one of the corpora provided today, or a particular novel that you find on one of the text repository sites that we've talked about.

After you've finished exploring, reply to this GitHub Issue with your thoughts on the following:

clamb89 commented 4 years ago

I used gutenberg.org to download a plain text file of an Emily Dickinson poetry collection and then uploaded the text (after doing some very basic cleaning) to Voyant.

jak487 commented 4 years ago

I also used Project Gutenberg to copy and then paste the full text of Marcel Proust's Swann's Way into Voyant. I chose Voyant in part because of the application's tagline and its promise to "see through your text." I wanted to pair this notion of textual transparency via quantification with a text that is notoriously difficult to quantify humanistically by virtue of its unusual syntax, complex use of metaphor, and nonlinear narrative: Monty Python All England Summarize Proust Competition

dcnb commented 4 years ago

There's definitely beauty in that trend line. How strange and wonderful that is. 

I'm struck to by how these tools so often reflect back to us what we know thru some means already (intuition, the form and its relation to others), but that we can't quite articulate. both in the topic modelling examples and in kindred britain, we can often reflect after the analysis or lineal line is presented on the likelihood of that pattern emerging but the dh tools help articulate and reinforce those discoveries. I'm fond of thinking of these tools, and DH generally, as an expansion of traditional humanistic inquiry, not a disruption or revolution -- the questions remain the same--what is this about? what structures are at play here? what is it 'like' to be human?--but the means and scale by which we ask them are changed. 

Also thank you for the Monty Python. @owikle!

I also wonder if you compared whatever translation you used here against a different translation of proust, what you would see. The newer translations, of course, wouldn't be available. I think I read the Moncrief Kilmartin translation oh so many years ago (and just Swann's way, I've never gotten past that book, although i started the second (I think)). I've wanted to read the Lydia Davis translation, as I've heard it's good and I'm a fan of her stories.

Deep lunchtime thoughts! Thank you very much @jak487

Oh and the stopword list you found is for NLTK, which is a Python library of techniques, programs, etc. for dealing with language. Very dominant in most advanced applications analyzing text/language

dcnb commented 4 years ago

@clamb89 Dickinson!

I want to see that in Voyant. and I'm wondering how the different presentations of her work would appear differently. I'm assuming anything you got off of Gutenberg would be the Higginson/Loomis Todd presentations. It'd also be interesting to see if the different line breaks in the more recent presentations (Franklin) would change the way the tool saw the poems. Probably not, I'm guessing, which is interesting in its own right.

I don't know if it was listed this week, but the Dickinson Archive -- here's My life had stood a loaded gun -- is one of the best presentations of poetry/archives online. To be able to see all the editions (you have to hit the text button on the right) as well as the manuscript -- it just makes so much sense in terms of design. There's another interface trying to present a ton of information on one page. Seems to be common with these.

I was also wondering if the hyphen (or em dash) would be interesting to examine in Voyant. You'd have to remove it from the stopword list. It might be particularly interesting to see that in the "contexts" visualization.

And sorry you had an issue with InternetArchive. Finding those formats can be challenging. Especially because they put so many up there. Another case of TMI in the interface perhaps.

dcnb commented 4 years ago

Here's an article from the new yorker a few years back on the Internet Archive. I consider them one of the most important cultural organizations on earth. That sounds crazy when I write it, but I stand by it. I'm guessing @owikle and @evanwill feel similarly. But we're librarians ...

samsonmatthews commented 4 years ago

I did the follow up activity and then forgot to respond to this issue, sorry! I used archive.org to download a text file of Mary Shelley's Frankenstein. I used visual studio to get rid of chapter titles (sometimes referred to as 'letters') because I felt that these words would interfere with analysis of the text. I also decided to add 'letter' to the stop word list that Olivia provided, because that word is mentioned frequently in the section written as letters, and I felt it would skew my data (although looking back, maybe the inclusion of 'letters' could provide insight into the choice of syntax in structuring the first section as a collection of letters). This particular choice definitely heavily impacted the data that I received.

I actually decided to explore both Voyant and JsLDA. I really like Voyant because it has a lot of visual learning things that are really cool to look at, but also help me understand the relationships better (as a visual learner). Voyant proved to be quite the useful tool, there are so many different ways that analysis is conducted. I particularly liked the 'Dreamscape' tool, which connected locational words used in the text to geospatial mapping. Of course, it wasn't completely accurate, but an interesting visualization of the setting of the book, and where these fictional relationships took place. I think that Voyant is so vast in modes of visualization that it got a little overwhelming for me to use, and I had to switch gears for a while in order to keep myself on track. I'm not sure I love the user interface of Voyant, I wish I could get rid of screens and add others, make it a little bit more customizable (maybe you can do this, and I'm just inexperienced).

JsLDA, and topic modeling in general, I find fascinating because it offers perspectives that you really wouldn't be able to consider otherwise. I ran 600 iterations of the modeling, and got some really interesting results. My favorite set of words was as follows: "sun room found eyes lay body great upon ground covered". I don't know why I never considered this before, but it could be interesting to examine Frankenstein not only as one hailing the origin of science fiction and of canon, but also one examining environmental humanities, our own relationships to the environment around us (and our own bodies), and the way we cultivate communities. Of course these are themes of Frankenstein, but I never really extending these themes beyond conventional analysis. Thinking about and using text in this way definitely pushed my thought processes in really interesting directions. I think the most exciting thing about these tools for me is that they really are here for your own discovery, offer new perspectives and ways of approaching text and analysis.