Open juanshishido opened 9 years ago
We might also consider using Word2Vec for this reduction step. Nice StitchFix post.
I just pushed a notebook to my branch that combines frequent unigrams and 4grams with hypernyms from WordNet (summarize_essays.ipynb). I kept it as a notebook because it still needs to be messed with.
It takes the 1000 most frequent unigrams and extracts hypernyms from these unigrams (from WordNet and code from class). After the hypernyms are calculated, it uses examples of these hypernyms as seeds to find contextual 4-grams. Finally, it filters the 4grams to keep only those that occur more than 20 times.
The code isn't super pretty and a bit confusing - it can be cleaned up a lot. I tried to load in functions from calculate_pmi but failed so I redefined them.
Great, thanks! I'll check it out tonight. I'll plan to move the notebook over to master
to make sure it works with calculate_pmi_features.py
.
I was going to merge your branch with master to get the notebook, but since I changed a few files, I did not want to break any uncommitted changes you had locally. The unfortunate side effect is that the commit of your notebook is attributed to me.
That notebook is slick, @matarhaller! I particularly like the back to the future 4-gram. Looks like this code will be useful for the topic descriptions.
Thanks @juanshishido! It's a little hacky, but should be easy to build on for more fancy topic descriptions.
Maybe we can, in cases where there are distinct differences, show the distribution of word types by group (e.g., pronouns, verbs, etc.). For another visualization.
Maybe! That's a nice idea as well!
Thinking about ways to describe text topics.
The "topics" we'll try to explain are those defined by text in a given cluster. Text could be an individual or a group of (even all) essay responses—based on how the clusters were established.
This analysis may look at the original text or some cleaned version obtained in a previous step (such as when calculating the PMI). The basic approach I'm thinking of is:
Tokens will be lemmatized and stopwords will be removed.
Key(word)phrase Extraction
This could be done in several ways.
tf
portion. For theidf
, use this single document as well as all of the other individual documents (essay responses).I'm not sure how any of these methods will perform.
We could also use "standard" keyphrase extraction techniques that look at noun phrases along with other tokens. This might be more difficult to reduce, though. Still, it should be explored.
Hypernyms
Based on the first part, we could reduce the words to their higher-level categories. WordNet might be the way to go here (the
hypernym_path()
method?). An example could be with keywords such as baseball, basketball, football, hockey, etc. that would map to "sports."Other
There are other ways to summarize documents, including Luhn and TextRank, both of which are implemented in
sumy
.