Open matarhaller opened 9 years ago
Okay, let's try lemmatizing the words before we create the data matrix. That may help.
I just added a bit of code in. I'll push in an hour, got a meeting.
@jnaras: I am going to push the notebook with the fixes for the nans (from #4). I am choosing this instead of updating the original notebook since you mentioned you added some code. I want to avoid merge conflicts.
What do the both of you think about moving functions to .py
files? This would help if two or more people are working on the same code. Then the notebooks could just show us output or how to use the code.
That works for me.
Cool, thanks.
Back on topic. @semerj mentioned that 18% is pretty good for text!
Really? For Marti's paper they used 3 factors and accounted for 48% of the variance. It seems like 10 components and 18% is really low, but we can try and see if it works... Although maybe after lemmatizing it will be cleaner.
Hmm. Did Marti and her co-author reduce the text prior to the factor analysis (like we did with the 1%)? Or was that their dimensionality reduction approach? (I don't know much about factor analysis.)
Yeah, we'll see what happens after lemmatizing.
PCA isn't a panacea. Your performance is going to depend heavily on your features/preprocessing.
Thanks, @semerj. I think we were surprised by our value, especially in comparison to Marti's previous work, which I hadn't read too closely. We have a few other things to try, so we'll see how that works.
Also wondering whether SVD or Non-negative matrix factorization might work well with text. I think both of these were mentioned in class.
PCA is basically SVD. But we could try it. If 18% works well, that's fine. Marti's previous work used factor analysis to explain 40% of the variance, but it was also a smaller dataset. I think we should then really use the lemmatizing. I'll push that code into the PMI notebook
As far as code organization goes, I'm hoping that the PMI notebook can just be used as a black box to create the data matrix. Unless you both would rather I push a .py file conversion of it?? Let me know.
I'm okay with it being used as a black box for now to create the data matrix, but when we need to submit the project it might make it cleaner if they were py files. But we can cross that bridge when we get there. On Nov 23, 2015 6:54 PM, "Jaya Narasimhan" notifications@github.com wrote:
PCA is basically SVD. But we could try it. If 18% works well, that's fine. Marti's previous work used factor analysis to explain 40% of the variance, but it was also a smaller dataset. I think we should then really use the lemmatizing. I'll push that code into the PMI notebook
As far as code organization goes, I'm hoping that the PMI notebook can just be used as a black box to create the data matrix. Unless you both would rather I push a .py file conversion of it?? Let me know.
— Reply to this email directly or view it on GitHub https://github.com/juanshishido/okcupid/issues/7#issuecomment-159136651.
I agree with and prefer having py files. But, as Matar mentioned, that's a lower priority right now.
Okay, I pushed it. I didn't have any 'nan' entries in the data matrix at the end. If that doesn't really help, we can try stemming as well.
Thanks, @jnaras!
Also, thanks for mentioning that PCA is basically SVD! Link with more info (mostly for myself).
@jnaras Did you remove blank TotalEssays
? If not, how are there no NaNs? I thought that was what was causing data_matrix
to have NaNs. See this.
@jnaras I ran your notebook and the datamatrix I got still had NaNs. Not sure if I'm doing something wrong?
The last two lines of get_data
should be:
df['TotalEssays'] = df['TotalEssays'].apply(lambda x: BeautifulSoup(x).getText().replace('\n', ' '))\
.apply(lambda x: re.sub('\s+', ' ', x).strip())
return df[df.TotalEssays.str.len() > 0]
The last line of my notebook asks if there are any NaNs in the data. Without Juan's add-on, I didn't get any. I don't really understand why. But maybe try it?
Okay, I pushed a new fix to the notebook with the additional lines from Juan's edit. I'm sorry, I should have added it earlier. I was just confused where the NaN's came from.
Hopefully that helps. I wonder if the pickle module is messing up the data...I'll look into it.
@jnaras It's totally fine. I should have communicated better about the update. I wasn't explicit about the fix—I wrote it as a comment on an issue and pushed to master.
Anyway, thanks for updating the notebook! I will try to run it now. Also, I'm going to remove Calculate PMI features (NaNs).ipynb
since everything is in your version of Calculate PMI features.ipynb
.
One additional note. The change to get_data
doesn't only keep samples where all the essays were filled out (as we talked about here). It simply keeps data where at least one character was written in at least one essay. For this, we can instead count NaNs for essays row-wise and only keep rows where that value is zero. (Then, the lines we added to get_data
will be superfluous.)
Thanks again!
After trying a bunch of things, here is a summary of where we are at with PCA/kmeans
Given that we have 3k features, I guess 50 components isn't too much. I was thinking of sticking with 50 components for now, since 45% of the variance seems respectable. Or maybe even dropping to 20.
kmeans++
. The best score I got was about 0.4 with 3 clusters (which seems too few to me). Adding more clusters consistently made the score go down.The code is ugly and PCA takes a long time to run with 50 components. Until I clean up the code, I saved out the components and the reduced data matrix into a pickle file so you can play around with it. I'll email it to you guys (I think it's small enough).
Thoughts?
Thanks for the detailed message, @matarhaller! In terms of the running time for PCA, it looks as if IncrementalPCA
might be faster while "almost exactly match[ing] the results of PCA." I'm fine with keeping 50 principle components. Also, thanks for the pickle file!
Are the clusters still very different while using kmeans++
? This might suggest that the "natural" clusters are elongated or irregularly shaped, which is bad news for k-means.
It's great that you started using the silhouette score. Three clusters might not be too bad given that we're combining all of the essay responses. I've been thinking a lot about this today. Is it possible that combining makes it more difficult to find different "topics" people are writing about? What I'm thinking of is that a set of users might write about topic A for prompt 0 while another set writes about topic B. However, what if the second set writes about topic A for prompt 1? I'm thinking that across prompts, what's written about might not be too distinctive. Just conjecture, but something to think about.
Since PCA just needs to be run once, and the data matrix fits in memory, I opted to just to use the standard PCA
instead of IncrementalPCA
. I was reading the documentation and it says that PCA only works with dense arrays - I know we did the .to_dense
transformation, but does it matter that the matrix is technically still sparse (many zero entries)? I don't think it matters, but I just want to confirm.
The only indication I have that the clusters are different on different runs of kmeans++
is that the silhouette score changes (and sometimes kmeans fails completely). We might want to try a different clustering algorithm...
As for separating out different essays - I need to think about this a bit more, but we can try just using a single essay and seeing if it separates out better. My inclination is that since we're using ngrams, we aren't really getting at broader topics anyway, so I don't know if it would be sensitive to the example you gave above. Not sure though...
They have an example on the Iris data set, which, for sure, fits into memory, using IncrementalPCA
. I'm still curious whether it would speed things up on the data we're using. Of course, as you mentioned, a different clustering algorithm might work better.
Regarding the documentation noting that PCA only works with dense arrays, I think they mean the actual data structure or representation of the data and not the values themselves.
I'd like to think more about the last point, too. Even though we're using ngrams, I'm thinking it might still be influenced. Let me try another example in 2d space with just two essays. If the features we were clustering on, for example, are the tokens "good" and "bad," a few possibilities are:
Imagine "good" along the x-axis and "bad" along the y-axis. Regardless of whether we combine essays or not, users in scenarios 1 and 3 would presumably be correctly grouped. We can imagine them on opposite sides of the coordinate system—large x and small y versus small x and large y. If combining essays for users under scenario 2, however, they might be in between (and possibly overlap with) users in 1 and 3, making it would be difficult to tell which cluster they might belong to. If the essays are separate, though, they we know those under scenario 2 will be in the "good" group for one of the essays and the "bad" group for the other.
I'm not sure if this might even be remotely representative of what's happening in our data, but it's what I was thinking about. Thanks for entertaining the idea, though!
Re: IncrementalPCA - Since we're only running PCA one time, I think I'm okay with just doing the standard one. Also since it seems like the IncrementalPCA is an approximation of PCA, I think we're better off just using PCA if we can.
I agree with your assessment of dense arrays - I just wanted to hear it from someone else :)
And as for splitting up the essays - I think it's an empirical question. We can try! Not sure though if we should just focus on a single essay or if we should do it separately for a few (and then if so, which do we pick?)
You're hard to convince, @matarhaller!
@juanshishido If incremental PCA can be used for matrices that don't all fit into memory, it's going to hit disk a lot. This will actually make it a lot slower.
@matarhaller Okay! I'll try to read up on varimax rotation and implement it in python for not square matrices.
@both, I'll convert my data_matrix generation script into .py for ease of use and merging.
@jnaras I think I'll need to %timeit
:sweat_smile:
Wow, ~73% slower!
[1]: from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.decomposition import IncrementalPCA
[2]: iris = load_iris()
X = iris.data
[3]: %%timeit
pca = PCA(n_components=2)
pca.fit(X)
10000 loops, best of 3: 187 µs per loop
[4]: %%timeit
# batch size to control memory usage
ipca = IncrementalPCA(n_components=2, batch_size=150)
ipca.fit(X)
1000 loops, best of 3: 324 µs per loop
Thanks, @jnaras!
Looks like PCA will be harder than we though. When I take the first 10 principal components (after whitening), they collectively only explain about 18% of the variance. I'm not sure if the problem is in the data we're putting in (maybe imputing isn't helping our cause?), but we might need to think about this a little bit.