juanshishido / okcupid

Analyzing online self-presentation
MIT License
5 stars 0 forks source link

variance explained #7

Open matarhaller opened 8 years ago

matarhaller commented 8 years ago

Looks like PCA will be harder than we though. When I take the first 10 principal components (after whitening), they collectively only explain about 18% of the variance. I'm not sure if the problem is in the data we're putting in (maybe imputing isn't helping our cause?), but we might need to think about this a little bit.

jnaras commented 8 years ago

Okay, let's try lemmatizing the words before we create the data matrix. That may help.

jnaras commented 8 years ago

I just added a bit of code in. I'll push in an hour, got a meeting.

juanshishido commented 8 years ago

@jnaras: I am going to push the notebook with the fixes for the nans (from #4). I am choosing this instead of updating the original notebook since you mentioned you added some code. I want to avoid merge conflicts.

What do the both of you think about moving functions to .py files? This would help if two or more people are working on the same code. Then the notebooks could just show us output or how to use the code.

matarhaller commented 8 years ago

That works for me.

juanshishido commented 8 years ago

Cool, thanks.

juanshishido commented 8 years ago

Back on topic. @semerj mentioned that 18% is pretty good for text!

matarhaller commented 8 years ago

Really? For Marti's paper they used 3 factors and accounted for 48% of the variance. It seems like 10 components and 18% is really low, but we can try and see if it works... Although maybe after lemmatizing it will be cleaner.

juanshishido commented 8 years ago

Hmm. Did Marti and her co-author reduce the text prior to the factor analysis (like we did with the 1%)? Or was that their dimensionality reduction approach? (I don't know much about factor analysis.)

Yeah, we'll see what happens after lemmatizing.

semerj commented 8 years ago

PCA isn't a panacea. Your performance is going to depend heavily on your features/preprocessing.

juanshishido commented 8 years ago

Thanks, @semerj. I think we were surprised by our value, especially in comparison to Marti's previous work, which I hadn't read too closely. We have a few other things to try, so we'll see how that works.

Also wondering whether SVD or Non-negative matrix factorization might work well with text. I think both of these were mentioned in class.

jnaras commented 8 years ago

PCA is basically SVD. But we could try it. If 18% works well, that's fine. Marti's previous work used factor analysis to explain 40% of the variance, but it was also a smaller dataset. I think we should then really use the lemmatizing. I'll push that code into the PMI notebook

As far as code organization goes, I'm hoping that the PMI notebook can just be used as a black box to create the data matrix. Unless you both would rather I push a .py file conversion of it?? Let me know.

matarhaller commented 8 years ago

I'm okay with it being used as a black box for now to create the data matrix, but when we need to submit the project it might make it cleaner if they were py files. But we can cross that bridge when we get there. On Nov 23, 2015 6:54 PM, "Jaya Narasimhan" notifications@github.com wrote:

PCA is basically SVD. But we could try it. If 18% works well, that's fine. Marti's previous work used factor analysis to explain 40% of the variance, but it was also a smaller dataset. I think we should then really use the lemmatizing. I'll push that code into the PMI notebook

As far as code organization goes, I'm hoping that the PMI notebook can just be used as a black box to create the data matrix. Unless you both would rather I push a .py file conversion of it?? Let me know.

— Reply to this email directly or view it on GitHub https://github.com/juanshishido/okcupid/issues/7#issuecomment-159136651.

juanshishido commented 8 years ago

I agree with and prefer having py files. But, as Matar mentioned, that's a lower priority right now.

jnaras commented 8 years ago

Okay, I pushed it. I didn't have any 'nan' entries in the data matrix at the end. If that doesn't really help, we can try stemming as well.

juanshishido commented 8 years ago

Thanks, @jnaras!

Also, thanks for mentioning that PCA is basically SVD! Link with more info (mostly for myself).

juanshishido commented 8 years ago

@jnaras Did you remove blank TotalEssays? If not, how are there no NaNs? I thought that was what was causing data_matrix to have NaNs. See this.

matarhaller commented 8 years ago

@jnaras I ran your notebook and the datamatrix I got still had NaNs. Not sure if I'm doing something wrong?

juanshishido commented 8 years ago

The last two lines of get_data should be:

df['TotalEssays'] = df['TotalEssays'].apply(lambda x: BeautifulSoup(x).getText().replace('\n', ' '))\
                                     .apply(lambda x: re.sub('\s+', ' ', x).strip())
return df[df.TotalEssays.str.len() > 0]
jnaras commented 8 years ago

The last line of my notebook asks if there are any NaNs in the data. Without Juan's add-on, I didn't get any. I don't really understand why. But maybe try it?

jnaras commented 8 years ago

Okay, I pushed a new fix to the notebook with the additional lines from Juan's edit. I'm sorry, I should have added it earlier. I was just confused where the NaN's came from.

Hopefully that helps. I wonder if the pickle module is messing up the data...I'll look into it.

juanshishido commented 8 years ago

@jnaras It's totally fine. I should have communicated better about the update. I wasn't explicit about the fix—I wrote it as a comment on an issue and pushed to master.

Anyway, thanks for updating the notebook! I will try to run it now. Also, I'm going to remove Calculate PMI features (NaNs).ipynb since everything is in your version of Calculate PMI features.ipynb.

One additional note. The change to get_data doesn't only keep samples where all the essays were filled out (as we talked about here). It simply keeps data where at least one character was written in at least one essay. For this, we can instead count NaNs for essays row-wise and only keep rows where that value is zero. (Then, the lines we added to get_data will be superfluous.)

Thanks again!

matarhaller commented 8 years ago

After trying a bunch of things, here is a summary of where we are at with PCA/kmeans

var_explained

cumsum_var_explained

Given that we have 3k features, I guess 50 components isn't too much. I was thinking of sticking with 50 components for now, since 45% of the variance seems respectable. Or maybe even dropping to 20.

The code is ugly and PCA takes a long time to run with 50 components. Until I clean up the code, I saved out the components and the reduced data matrix into a pickle file so you can play around with it. I'll email it to you guys (I think it's small enough).

Thoughts?

juanshishido commented 8 years ago

Thanks for the detailed message, @matarhaller! In terms of the running time for PCA, it looks as if IncrementalPCA might be faster while "almost exactly match[ing] the results of PCA." I'm fine with keeping 50 principle components. Also, thanks for the pickle file!

Are the clusters still very different while using kmeans++? This might suggest that the "natural" clusters are elongated or irregularly shaped, which is bad news for k-means.

It's great that you started using the silhouette score. Three clusters might not be too bad given that we're combining all of the essay responses. I've been thinking a lot about this today. Is it possible that combining makes it more difficult to find different "topics" people are writing about? What I'm thinking of is that a set of users might write about topic A for prompt 0 while another set writes about topic B. However, what if the second set writes about topic A for prompt 1? I'm thinking that across prompts, what's written about might not be too distinctive. Just conjecture, but something to think about.

matarhaller commented 8 years ago

Since PCA just needs to be run once, and the data matrix fits in memory, I opted to just to use the standard PCA instead of IncrementalPCA. I was reading the documentation and it says that PCA only works with dense arrays - I know we did the .to_dense transformation, but does it matter that the matrix is technically still sparse (many zero entries)? I don't think it matters, but I just want to confirm.

The only indication I have that the clusters are different on different runs of kmeans++ is that the silhouette score changes (and sometimes kmeans fails completely). We might want to try a different clustering algorithm...

As for separating out different essays - I need to think about this a bit more, but we can try just using a single essay and seeing if it separates out better. My inclination is that since we're using ngrams, we aren't really getting at broader topics anyway, so I don't know if it would be sensitive to the example you gave above. Not sure though...

juanshishido commented 8 years ago

They have an example on the Iris data set, which, for sure, fits into memory, using IncrementalPCA. I'm still curious whether it would speed things up on the data we're using. Of course, as you mentioned, a different clustering algorithm might work better.

Regarding the documentation noting that PCA only works with dense arrays, I think they mean the actual data structure or representation of the data and not the values themselves.

I'd like to think more about the last point, too. Even though we're using ngrams, I'm thinking it might still be influenced. Let me try another example in 2d space with just two essays. If the features we were clustering on, for example, are the tokens "good" and "bad," a few possibilities are:

  1. good for both essays
  2. more good for one essay and more bad for the other
  3. bad for both essays

Imagine "good" along the x-axis and "bad" along the y-axis. Regardless of whether we combine essays or not, users in scenarios 1 and 3 would presumably be correctly grouped. We can imagine them on opposite sides of the coordinate system—large x and small y versus small x and large y. If combining essays for users under scenario 2, however, they might be in between (and possibly overlap with) users in 1 and 3, making it would be difficult to tell which cluster they might belong to. If the essays are separate, though, they we know those under scenario 2 will be in the "good" group for one of the essays and the "bad" group for the other.

I'm not sure if this might even be remotely representative of what's happening in our data, but it's what I was thinking about. Thanks for entertaining the idea, though!

matarhaller commented 8 years ago

Re: IncrementalPCA - Since we're only running PCA one time, I think I'm okay with just doing the standard one. Also since it seems like the IncrementalPCA is an approximation of PCA, I think we're better off just using PCA if we can.

I agree with your assessment of dense arrays - I just wanted to hear it from someone else :)

And as for splitting up the essays - I think it's an empirical question. We can try! Not sure though if we should just focus on a single essay or if we should do it separately for a few (and then if so, which do we pick?)

juanshishido commented 8 years ago

You're hard to convince, @matarhaller!

jnaras commented 8 years ago

@juanshishido If incremental PCA can be used for matrices that don't all fit into memory, it's going to hit disk a lot. This will actually make it a lot slower.

@matarhaller Okay! I'll try to read up on varimax rotation and implement it in python for not square matrices.

@both, I'll convert my data_matrix generation script into .py for ease of use and merging.

juanshishido commented 8 years ago

@jnaras I think I'll need to %timeit :sweat_smile:

juanshishido commented 8 years ago

Wow, ~73% slower!

[1]: from sklearn.datasets import load_iris
     from sklearn.decomposition import PCA
     from sklearn.decomposition import IncrementalPCA

[2]: iris = load_iris()
     X = iris.data

[3]: %%timeit
     pca = PCA(n_components=2)
     pca.fit(X)
10000 loops, best of 3: 187 µs per loop

[4]: %%timeit
     # batch size to control memory usage
     ipca = IncrementalPCA(n_components=2, batch_size=150)
     ipca.fit(X)
1000 loops, best of 3: 324 µs per loop

Thanks, @jnaras!