describing topics - Githubissues

juanshishido commented 9 years ago

Thinking about ways to describe text topics.

The "topics" we'll try to explain are those defined by text in a given cluster. Text could be an individual or a group of (even all) essay responses—based on how the clusters were established.

This analysis may look at the original text or some cleaned version obtained in a previous step (such as when calculating the PMI). The basic approach I'm thinking of is:

keyphrase extraction
reduction using hypernyms

Tokens will be lemmatized and stopwords will be removed.

Key(word)phrase Extraction

This could be done in several ways.

token frequency
- n-grams
tfidf
- For a given cluster, create a single document—the concatenated text for all users in that cluster. This is for the tf portion. For the idf, use this single document as well as all of the other individual documents (essay responses).
co-occurrence
- Build a co-occurrence matrix for a single document (the concatenated text for the cluster) and do not define the diagonal. Words that co-occur with other words more often (in sentences) than they would if they were randomly distributed could be thought of as "important." Use the chi-squared test to determine statistical significance and to "control" for words that occur infrequently. (Based on Matsuo and Ishizuka.)
subtracting token distributions
- This would be a one-versus-rest-type approach. We calculate the normalized distribution of tokens for the cluster and subtract out the normalized distribution of the other clusters. The idea is to get the words that are most unique to the given cluster.
rapid automatic keyword extraction
- RAKE

I'm not sure how any of these methods will perform.

We could also use "standard" keyphrase extraction techniques that look at noun phrases along with other tokens. This might be more difficult to reduce, though. Still, it should be explored.

Hypernyms

Based on the first part, we could reduce the words to their higher-level categories. WordNet might be the way to go here (the hypernym_path() method?). An example could be with keywords such as baseball, basketball, football, hockey, etc. that would map to "sports."

Other

There are other ways to summarize documents, including Luhn and TextRank, both of which are implemented in sumy.

juanshishido commented 9 years ago

We might also consider using Word2Vec for this reduction step. Nice StitchFix post.

matarhaller commented 9 years ago

I just pushed a notebook to my branch that combines frequent unigrams and 4grams with hypernyms from WordNet (summarize_essays.ipynb). I kept it as a notebook because it still needs to be messed with.

It takes the 1000 most frequent unigrams and extracts hypernyms from these unigrams (from WordNet and code from class). After the hypernyms are calculated, it uses examples of these hypernyms as seeds to find contextual 4-grams. Finally, it filters the 4grams to keep only those that occur more than 20 times.

The code isn't super pretty and a bit confusing - it can be cleaned up a lot. I tried to load in functions from calculate_pmi but failed so I redefined them.

juanshishido commented 9 years ago

Great, thanks! I'll check it out tonight. I'll plan to move the notebook over to master to make sure it works with calculate_pmi_features.py.

juanshishido commented 9 years ago

I was going to merge your branch with master to get the notebook, but since I changed a few files, I did not want to break any uncommitted changes you had locally. The unfortunate side effect is that the commit of your notebook is attributed to me.

juanshishido commented 9 years ago

That notebook is slick, @matarhaller! I particularly like the back to the future 4-gram. Looks like this code will be useful for the topic descriptions.

matarhaller commented 9 years ago

Thanks @juanshishido! It's a little hacky, but should be easy to build on for more fancy topic descriptions.

juanshishido commented 8 years ago

Maybe we can, in cases where there are distinct differences, show the distribution of word types by group (e.g., pronouns, verbs, etc.). For another visualization.

jnaras commented 8 years ago

Maybe! That's a nice idea as well!

juanshishido / okcupid

describing topics #10

Key(word)phrase Extraction

Hypernyms

Other