juanshishido / okcupid

Analyzing online self-presentation
MIT License
5 stars 0 forks source link

lemmatizing #8

Closed matarhaller closed 8 years ago

matarhaller commented 8 years ago

I'm not sure if this is an issue, so much as a question... The output of the trigrams (after lemmatizing, removing stopwords, etc) is the following:

[(('making', 'people', 'laugh'), 3282), (('http', ':/', 'www'), 2616), (('spend', 'lot', 'time'), 2526), (('meeting', 'new', 'people'), 2468), (("i'm", 'really', 'good'), 2159), (('trying', 'new', 'thing'), 2036), (('meet', 'new', 'people'), 1880), (('pretty', 'much', 'anything'), 1771), (('www', 'youtube', 'com'), 1582), (('typical', 'friday', 'night'), 1574)]

Both (('meeting', 'new', 'people'), 2468) and (('meet', 'new', 'people'), 1880) are included.

If we lemmatized meeting shouldn't they be counted as the same (ie, ('meet', 'new', 'people'))?

matarhaller commented 8 years ago

nevermind, I guess because i things meeting is a noun and not a verb.