leios / SoME_Topics

Collaboration / Topic requests for SoME
Other
212 stars 6 forks source link

Types, Tokens, Hapaxes, Legomena, and other Dr. Seuss sounding Linguistic thingamajigs #174

Closed VictorDavis closed 2 years ago

VictorDavis commented 2 years ago

About the author

Hello, producers! I am a Data Scientist @ Toptal, programming computers and doing analytics professionally by day, and as a side interest, I do some math-related reading and research at night. A few years ago I published a paper about a phenomenon in computational linguistics known as Heap's Law, a less well-known cousin of Zipf's Law.

Quick Summary

What's so beautiful about computational linguistics is that when we speak or write, we are conveying very detailed and arbitrary information from brain to brain, and so it seems counter-intuitive that the statistics measuring various properties of text should fit any kind of physics-like equations. For example, the statement "common words tend to be shorter, rare words tend to be longer" makes sense when you think about it, yet still produces a very satisfying "aha" moment to a first-time grasper of this little linguistic truth. But if you keep thinking about it, or scribble out some information-theoretic description of what exactly is being optimized here, it becomes deeply mysterious: How can little push-and-pull mechanisms like "parsimony" and "specificity" massage a language into such a regular shape, with curves and edges describable by some physics-like theory? How can any equation describe the rough features of the sentences I speak or type, when those features are totally unrelated to the actual content? I would like to take a lay reader or viewer on this particular journey using a technical example to invoke this mystical feeling. I would love to have my paper adapted into a video explainer, but of course I am biased, and I can quite easily point you in the direction of any number of papers exploring this area.

Target medium

I am a long-time fan of 3b1b and Numberphile, and I would love to see more of these subtle and beautiful ideas from computational linguistics explored via this format. VSauce has an excellent video about Zipf's Law, for example. I think these videos are hard to find because it's a relative mathematical backwater, in the sense that precise fields like geometry and number theory get quite a lot of airtime, but as soon as you inject any kind of randomness or stochasticity into a process, it loses that feel of mathematical "absoluteness" and gets pigeonholed into a less snazzy category like "numerical experiments." But importantly, more and more modern mathematicians are thinking about their fields and problems probabilistically; even results in number theory are often motivated by or even proved via "probabilistic arguments", as James Maynard explains in this video about a result he obtained related to the twin prime conjecture.

More details

Take any bit of text you like, Moby Dick for example. The opening line, "Call me Ishmael", contains three unrepeated words. Let's plot this as point $(3,3)$. But keep going and you're bound to repeat a word eventually. Take the first paragraph: "Call me Ishmael. Some years ago—never mind how long precisely [...] almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me." It contains $201$ words, but many of these repeat, leaving only $134$ distinct words. Let's plot this as point $(201, 134)$. In the biz, we call each instance of a word a "token", and the word itself, the dictionary entry if you like, a "type". What does this plot look like if we keep taking this statistic and plotting points? Well, from Wikipedia, a typical Type-Token Relation looks like this:

NOTE: There's a little bit of subjectivity in what you count as a "word" or a word break. For example, "ago—never" is obviously two words, but what about "sea—faring"? But although the exact counts will vary based on such nuanced methodological choices, the overall pattern is the same.

This looks an awful lot like a "natural" function, like f(x) = ln(x) or f(x) = sqrt(x). Stanley Harold Heaps, in 1960, made a simple educated guess that this plot was "log-linear", that is, that the "vocabulary size" V grew as a function of the total number of words like V(n) = Kn^B, or alternatively, that log(V) = B log(n) + K. Meaning if you plotted the log values of the size of the vocabulary as a function of the total amount of text, you can fit a straight line through the data. However, from the moment this "law" was written down, it was known to be a first-approximation at best, and outright wrong at worst. The data never exactly fits, there's always a kind of "bend" in this line suggesting it's not a line at all, and there's no particular linguistic mechanism that would justify this pattern. So researcher after researcher has crunched the data in different ways, putting forward model after model tweaking Heap's original equation, and there the science stands: A known-to-be-flawed initial conjecture and a menagerie of competing proposals to fix it. I am but one in a long line of researchers to throw my hat into the ring, claiming that my equation fits the data more accurately and cleanly than the rest.

Controversy and ego aside, this is actually an interesting part of the story, if not the central kernel of it... In the world of data science, there is often a tradeoff between accuracy and simplicity. When we're able to thread the needle and get a lot more of both than the tradeoff usually allows, we call that "beauty." All these papers, all these models, compete in the space of who has the more "beautiful" derivation or mechanism explaining why it has to be true. And while I--biased model daddy I may be--believe that mine is the one that should win favor in this competition on the basis of beauty alone, there is a meta-argument at play here, best popularized by Sabine Hossenfelder: How relevant is beauty in nature, and does chasing it as a kind of heuristic guide us toward the truth or lead us astray?

Contact details

The email address on my GitHub profile.

Additional context

I can provide plenty of references, papers, and background material to any interested reader, depending on the direction this storyline may take.

VictorDavis commented 2 years ago

Found a collaborator.