juanshishido / okcupid

Analyzing online self-presentation
MIT License
5 stars 0 forks source link

describing topics #10

Open juanshishido opened 8 years ago

juanshishido commented 8 years ago

Thinking about ways to describe text topics.

The "topics" we'll try to explain are those defined by text in a given cluster. Text could be an individual or a group of (even all) essay responses—based on how the clusters were established.

This analysis may look at the original text or some cleaned version obtained in a previous step (such as when calculating the PMI). The basic approach I'm thinking of is:

Tokens will be lemmatized and stopwords will be removed.

Key(word)phrase Extraction

This could be done in several ways.

I'm not sure how any of these methods will perform.

We could also use "standard" keyphrase extraction techniques that look at noun phrases along with other tokens. This might be more difficult to reduce, though. Still, it should be explored.

Hypernyms

Based on the first part, we could reduce the words to their higher-level categories. WordNet might be the way to go here (the hypernym_path() method?). An example could be with keywords such as baseball, basketball, football, hockey, etc. that would map to "sports."

Other

There are other ways to summarize documents, including Luhn and TextRank, both of which are implemented in sumy.

juanshishido commented 8 years ago

We might also consider using Word2Vec for this reduction step. Nice StitchFix post.

matarhaller commented 8 years ago

I just pushed a notebook to my branch that combines frequent unigrams and 4grams with hypernyms from WordNet (summarize_essays.ipynb). I kept it as a notebook because it still needs to be messed with.

It takes the 1000 most frequent unigrams and extracts hypernyms from these unigrams (from WordNet and code from class). After the hypernyms are calculated, it uses examples of these hypernyms as seeds to find contextual 4-grams. Finally, it filters the 4grams to keep only those that occur more than 20 times.

The code isn't super pretty and a bit confusing - it can be cleaned up a lot. I tried to load in functions from calculate_pmi but failed so I redefined them.

juanshishido commented 8 years ago

Great, thanks! I'll check it out tonight. I'll plan to move the notebook over to master to make sure it works with calculate_pmi_features.py.

juanshishido commented 8 years ago

I was going to merge your branch with master to get the notebook, but since I changed a few files, I did not want to break any uncommitted changes you had locally. The unfortunate side effect is that the commit of your notebook is attributed to me.

juanshishido commented 8 years ago

That notebook is slick, @matarhaller! I particularly like the back to the future 4-gram. Looks like this code will be useful for the topic descriptions.

matarhaller commented 8 years ago

Thanks @juanshishido! It's a little hacky, but should be easy to build on for more fancy topic descriptions.

juanshishido commented 8 years ago

Maybe we can, in cases where there are distinct differences, show the distribution of word types by group (e.g., pronouns, verbs, etc.). For another visualization.

jnaras commented 8 years ago

Maybe! That's a nice idea as well!