newtfire / introDH-Hub

shared repo for DIGIT 100: Introduction to Digital Humanities class at Penn State Erie, The Behrend College
https://newtfire.github.io/introDH-Hub/
Creative Commons Zero v1.0 Universal
8 stars 4 forks source link

Mystery Text Discussion: web.txt #100

Open ebeshero opened 5 months ago

ebeshero commented 5 months ago

Post your screenshots and discuss your findings about web.txt here!

ImStin commented 5 months ago

What ngram sizes give you frequency counts above 5?

When I search for ngrams at a size of 5 and lower I start to get frequencies that are above 5.

What ngram sizes give you frequency counts in the double digits?

I get my first frequency count in the double digits at a size of 4 and it is for one single phrase being "i don t like".

What phrases get repeated a lot in this text?

I think it is super intriguing that the more unique phrases that get repeated a lot in this text pertain to having the word "not" in it one way or another. In order to see this I used an ngram size of 3 to get rid of obvious grams such as "of the" and so on. image It becomes more apparent when I move my size up to a size of 4 ngram. The word or use of "not" is apparent in the top 5 of size 4 ngram as well. My first thought after all this with regards to the text is that the author has a good chunk of negative thoughts to understanding of ideas or situations. It also shows that there seems to be a ton of disagreements or rejections within the text. image

I decided to move into KWIC for the 3 size ngram of "i don t" to see how the author is using the ngram throughout the text and these are the results I have found. It is apparent that the phrase "I don't like" is here which was the most frequent ngram for a size of 4 but now we can see more. Phrases like "I don't think", "I don't care", "I don't want" are apparent frequently which seems to support the justification of the negative monotone that is being represented through the text by the author with this frequent phrase use. image

The last tool I used was voyant and I was just curious what the word map was gonna show me for results within this text. Main characters like "heathcliff" and "catherine" were the most commonly used words which let me know these are likely the two people the story may revolve around. I then did some digging on the internet and I believe I found that this text is from Wuthering Heights which is a novel from 1847. Overall, this conclusion really intrigued me as my guess is that back then, negative tones were more apparent in those time periods compared to now when it comes to writing. image

Tdaley05 commented 4 months ago

What ngram sizes give you frequency counts above 5? N-grams from 1 to 5 yield results Screenshot 2024-04-20 210710

What ngram sizes give you frequency counts in the double digits? Sizes 1-4 Screenshot 2024-04-20 210631

What phrases get repeated a lot in this text? “I don’t” “I can’t” “I could not” “Of the” Whether they are contractions or fully written forms, they tend to have “not” in them, which I was surprised about how often they come up. Screenshot 2024-04-20 210556

Choose some ngram clusters of interest and explore them in their KWIC (Keyword in Context) view to scope the words before and after. After “I could not,” the most common thing to come ofter was the word “help.” So commonly it would show up at “I could not help.” Screenshot 2024-04-20 210905 Screenshot 2024-04-20 210352

Voyant word map Screenshot 2024-04-20 210942