rtapiaoregui / collocater

Spacy integrable pipeline component to identify collocations in text
MIT License
5 stars 3 forks source link

Theory Question: Collocater vs. nltk Collocation #3

Open ChasNelson1990 opened 4 years ago

ChasNelson1990 commented 4 years ago

Hi there,

First up, apologies if this is a stupid question - I'm not an NLP person and some of the language and ideas are brand new to me.

So, as I understand it, collocation is the idea of commonly occurring sequences of words. Prior to actually looking into NLP this week, I would call this n-grams and I think the NLTK agrees with me. The NLTK collocations functions primarily look for n-grams, do some filtering and return those (see https://github.com/nltk/nltk/blob/develop/nltk/collocations.py).

So, if I run nltk's collocation on (as an example) The Hound of the Baskervilles, I get phrases like 'Mr. Holmes', 'Grimpen Mire', 'escaped convict' and 'missing boot' - these all seem pretty reasonable given the plot.

But if I run your collocater pipeline I get very different results (and it takes significantly longer to process). Key differences that I can see being: no proper nouns, I get duplicate entries and those duplicates aren't equal, e.g. I have several 'different' 'look at's returned.

So, I think the lack of proper nouns is caused by the fact that you're determining collocations from a collocation dictionary so words like 'Sherlock' will never be processed.

The duplicate entries I think roughly corresponds to the number of times that collocation occurs and the fact that duplicate entries aren't equal is presumably down to the SpaCy vectors on those tokens being non-equal.

So, my first question is: what are you actually doing to determine these collocations? Why do you need to refer to a dictionary source in order to extract these?

I have a series of follow-up questions that are more about implementation than linguistics algorithms but I think I need to understand the linguistic rationale before I start suggesting technical changes.

Hope you don't mind me reaching out like this.

rtapiaoregui commented 4 years ago

Hi, Chas,

Thanks for asking! It's by no means a stupid question. Not only that, I think you raise a very valid point that introduces an extremely pertinent topic of discussion.

I believe that the meaning of the term "collocation" intersects with that of "word n-gram" or "multi-word expression" in that it describes the habitual juxtaposition of a particular word with another word or words with a frequency greater than chance, but, contrary to "word n-gram" or "multi-word expression", "collocation" is a linguistic term, i.e. it's a term coined by linguists to describe a pragmatic phenomenon that allows us to verify the evolution of language over time, and what becomes a “collocation” are those commonly juxtaposed words whose juxtaposition have stood the test of time.

That’s also why this package only retrieves the “word n-grams” that can be found in the Online Oxford Collocation Dictionary for all the nouns and verbs SpaCy identifies in a piece of text each time they appear mentioned.

I’ve built it this way because I think it may turn out to be useful for some purposes to know whether a collocation such as “look at” appears mentioned only once or more times in the text unit meant to be inspected.

The whole purpose of this package is that of allowing people to determine the extent to which writers choose to employ language in a way that can convey something other than what’s explicitly stated, which I believe informs us on the reliability of said writers.

For more information on what I mean by this, check out my other project on Github: https://github.com/rtapiaoregui/pragmatrix

Best regards, Rita

ChasNelson1990 commented 4 years ago

Theory response: So... if I'm understanding correctly... the collocations being detected here could be described as 'language norm collocations', i.e. collocations that occur in the analysed text and are common/accepted in English as a whole?

So, if one took the list of n-gram-based collocations (from nltk) and compared that to the list of OOCD collocations then the intersection of the two sets would be 'standard' English and, in your thinking (if I understood correctly), the meaning is more likely to be explicit* and the difference of the two sets would be 'non-standard' and likely the meaning is 'artistic' in nature, e.g. metaphorical.

ChasNelson1990 commented 4 years ago

Implementation response: So, at the moment, to actually get a count of how many times 'look at' or similar was used in a text, one would have to loop through the return from your Collocater() object, extract token.text.lower() for each token add that to a dataframe with a count of 1 or, if it already exists in the dataframe, increase the count by 1? Would it not be more useful for the output of Collocater() to be this? E.g. a list of unique collocations (case insensitive) with a count of occurrences (and sorted by that so that if you show the first 20 entries, they're the 20 most common collocations)?

Also, is there a way to parallelise this? I note it was only using a single CPU for the whole processing time.

rtapiaoregui commented 4 years ago

Hi, Chas,

I don't think that collocations that appear in the OOCD but not in a list of word n-grams are more metaphoric in nature. What I believe is that, by employing them, an author shows that he or she understands and values the important role the passing of time plays on the capacity of a given message to mean something other than what it can mean the first time it's received, which, in my opinion, makes him or her more likely to render a message that can be understood as metaphorical.

Regarding the question as to whether the output of Collocater() should be a different one, I chose the one being rendered at present because it was the one I thought would be most useful for those who wanted to use the package how I thought could be useful to use it. You can check the website I built to find out more about that: https://collocations-finder.appspot.com/

I think you can use collections.Counter() on the keys of the dictionary output by Collocater() to retrieve the most frequently appearing collocations ranked according to their frequency of occurrence.

As for parallelisation techniques, I'm, unfortunately, not the biggest expert on the topic. There is probably a way to do it, but I would have to look further into it and didn't feel the need to include that functionality for the use I intended for the package.

Best, Rita

ChasNelson1990 commented 4 years ago

You could use collections but you'd still have to go through and parse everything the same as you would for a dataframe because collections relies on Python comparisons (==) and because of the way SpaCy tokens hold additional information the token 'look' at one place in the document and 'look' at another place in the document are not equal. So you have to go through and create a list which is just the token.text before collections will work (I believe).