newtfire / introDH-Hub

shared repo for DIGIT 100: Introduction to Digital Humanities class at Penn State Erie, The Behrend College
https://newtfire.github.io/introDH-Hub/
Creative Commons Zero v1.0 Universal
8 stars 4 forks source link

Mystery Text discussion of cpi.txt #79

Closed ebeshero closed 11 months ago

ebeshero commented 1 year ago

Post your screenshots and discuss your findings about cpi.txt here!

nhammer514 commented 1 year ago

I was looking at The Adventure of “The Western Star”, and its contents. Messing with the N-Gram size, I discovered that fller words and nouns have the highest frequencies, reaching into the double digits. The smaller the N-Gram size, more typical phrases with high frequencies. Aside from irrelevant texts, the phrase "I don't know" was used very often. Since it is a common saying when someone lacks knowledge, this should not be a surprise.